--- title: "PwC Case: Auto Insurance Analysis" date: 2020-06-06 categories: - Data Analytics ---
In [1]:
import pandas as pd
import numpy as np
from datetime import timedelta
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import statsmodels.formula.api as smf
In [2]:
# Pandas settings
pd.options.display.max_columns = None

# Seaborn styles
sns.set(
    style='white',
    font_scale=1.5,
    rc={
        'figure.figsize': (14, 8)
    }
)

Pre-define cell styles

In [3]:
new_column_style = lambda _: 'background-color: greenyellow'
highlight_style = lambda _: 'background-color: gold'
loss_ratio_style = lambda _: '{ color: indianred }'

Notes on this Jupyter notebook

  • Newly merged columns are marked with light green backgrounds.
  • Highlighted cells are marked with gold backgrounds.
  • Net Premium refers to Total Premium - Refunded Premium.
    For example, if a driver has paid \$1,000 to purchase a policy and was refunded \\$400 after cancelling the policy, the net premium is \$600.
  • seaborn and plotly packages are used for visualizations. To run the notebook, these packages should be installed.
    Installing seaborn should be straightforward (pip install seaborn or conda install seaborn).
    For plotly, please refer to https://plotly.com/python/getting-started/

Import/clean data

Read CSV files

Each sheet from the Excel file has been separated into separate .csv files. While reading csv files, I'll parse datetime type columns when reading csv.

In [4]:
df_agency_dimension = pd.read_csv('data/agency_dimension.csv')
df_claims_fact = pd.read_csv('data/claims_fact.csv')
df_claim_branches = pd.read_csv('data/claim_branches.csv')
df_claim_expense_dimension = pd.read_csv('data/claim_expense_dimension.csv')
df_communication_data = pd.read_csv('data/communication_data.csv')
df_company_dimension = pd.read_csv('data/company_dimension.csv')
df_driver_dimension = pd.read_csv('data/driver_dimension.csv')
df_location_dimension = pd.read_csv('data/location_dimension.csv')
df_policy_dimension = pd.read_csv('data/policy_dimension.csv', 
                                  parse_dates=['Policy Effective Date', 'Policy Expiration Date', 'Cancel Date'])
df_population_distribution = pd.read_csv('data/population_distribution.csv')
df_risk_fact = pd.read_csv('data/risk_fact.csv')
df_vehicles_dimension = pd.read_csv('data/vehicles_dimension.csv')

Exploring data

Is there anyone who purchased multiple policies?

I check whether a driver has purchased more than one policy.

In [5]:
# Number of unique drivers in risk_fact
df_risk_fact['Driver ID'].nunique()
Out[5]:
35712
In [6]:
df_risk_fact[df_risk_fact['Total Premium'] > 0].shape[0]
Out[6]:
35712
Conclusion No, each driver has purchased only 1 policy.

Pre-processing

Risk Fact

Reconciling inconsistent Vehicle IDs (for same drivers) between Risk Fact and Claims Fact dataframes

Case assumption #9 mentions that some drivers have different Vehicle IDs in risk fact and claim fact tables. Use claims fact table as the source of truth.

In [7]:
df_vehicle_by_driver_risk_fact = df_risk_fact[['Driver ID', 'Vehicle ID']].drop_duplicates()
df_vehicle_by_driver_risk_fact.head(3)
Out[7]:
Driver ID Vehicle ID
0 4758 50
1 5901 68
2 12972 43
In [8]:
df_vehicle_by_driver_claims_fact = df_claims_fact[['Claimant Id', 'Vehicle ID']].copy() \
    .drop_duplicates() \
    .rename(columns={
        'Claimant Id': 'Driver ID'
    })

df_vehicle_by_driver_claims_fact.head(3)
Out[8]:
Driver ID Vehicle ID
0 28498 78
1 5163 27
2 12472 17
In [9]:
df_vehicle_comparison = df_vehicle_by_driver_risk_fact.merge(right=df_vehicle_by_driver_claims_fact, on='Driver ID', how='left') \
    .rename(columns={
        'Vehicle ID_x': 'Vehicle ID Risk Fact', 
        'Vehicle ID_y': 'Vehicle ID Claims Fact'
    })

df_vehicle_comparison['Vehicle ID Claims Fact'] = df_vehicle_comparison['Vehicle ID Claims Fact'].astype('Int64')
df_vehicle_comparison = df_vehicle_comparison[df_vehicle_comparison['Vehicle ID Claims Fact'].notna()]

df_vehicle_comparison.head(3)
Out[9]:
Driver ID Vehicle ID Risk Fact Vehicle ID Claims Fact
7 25890 10 10
36 10613 42 31
131 6047 81 33
In [10]:
df_different_vehicles = df_vehicle_comparison[df_vehicle_comparison['Vehicle ID Risk Fact'] != df_vehicle_comparison['Vehicle ID Claims Fact']]
display(
    df_different_vehicles.head() \
        .style \
        .applymap(highlight_style, subset=['Vehicle ID Risk Fact', 'Vehicle ID Claims Fact'])
)

print(f'{df_different_vehicles.shape[0]} drivers have different vehicle IDs in risk fact and claims fact tables.')
Driver ID Vehicle ID Risk Fact Vehicle ID Claims Fact
36 10613 42 31
131 6047 81 33
166 9516 75 60
191 11894 49 43
245 17467 72 78
568 drivers have different vehicle IDs in risk fact and claims fact tables.

Create a new risk fact dataframe with updated vehicle IDs

In [11]:
df_risk_fact_vehicle_id_fixed = df_risk_fact.merge(right=df_different_vehicles[['Driver ID', 'Vehicle ID Claims Fact']], on='Driver ID', how='left')
df_risk_fact_vehicle_id_fixed['Vehicle ID'] = np.where(df_risk_fact_vehicle_id_fixed['Vehicle ID Claims Fact'].isna(), df_risk_fact_vehicle_id_fixed['Vehicle ID'], df_risk_fact_vehicle_id_fixed['Vehicle ID Claims Fact'])

display(
    df_risk_fact_vehicle_id_fixed[df_risk_fact_vehicle_id_fixed['Vehicle ID Claims Fact'].notna()] \
        .head() \
        .style \
        .applymap(highlight_style, subset=['Vehicle ID', 'Vehicle ID Claims Fact'])
)

df_risk_fact_vehicle_id_fixed.drop(columns=['Vehicle ID Claims Fact'], inplace=True)
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Vehicle ID Claims Fact
36 37 1007613 10613 31 2006 335 3090.57 31
131 132 1003047 6047 33 2007 162 2970.57 33
166 167 1006516 9516 60 2007 590 2934.57 60
191 192 1008894 11894 43 2006 529 3066.57 43
245 246 1014467 17467 78 2005 485 4566.57 78

Driver Dimension

Parse `Date of Birth" as a datetime format

Most date of birth values are in YYYY-MM-DD hh:mm:ss format. There are 2162 rows that have an invalid format (whitespace strings).

In [12]:
# Using errors='coerce' will force invalid formats into NaT
df_driver_dimension['Date of Birth'] = pd.to_datetime(df_driver_dimension['Date of Birth'], errors='coerce')
df_driver_dimension.head(3).style.applymap(new_column_style, subset='Date of Birth')
Out[12]:
Driver ID First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth
0 1 Duane Crance 3 0 M M 9 Work 1952-04-16 00:00:00
1 2 Sheldon Sherbert 1 0 S M 8 Leisure 1971-11-15 00:00:00
2 3 Kristian Brix 0 0 M M 40 Work 1989-01-13 00:00:00
In [13]:
print(f'Column data type of "Date of Birth": {df_driver_dimension["Date of Birth"].dtype}')
print(f'Number of missing (NaT) "Date of Birth": {df_driver_dimension["Date of Birth"].isna().sum()}')
Column data type of "Date of Birth": datetime64[ns]
Number of missing (NaT) "Date of Birth": 2162
In [14]:
df_driver_dimension[df_driver_dimension['Date of Birth'].isna()] \
    .head(3) \
    .style \
    .applymap(highlight_style, subset='Date of Birth')
Out[14]:
Driver ID First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth
13 14 Tammara Swimm 1 1 M M nan Leisure NaT
14 15 Gina Creek 6 3 S F nan Work NaT
135 136 Noel Vanderwood 3 1 S M nan Leisure NaT

We will have to take care of the missing birthdays later.

Policy Dimension

Add annual/semiannual classification to the policy dimension

What is the duration of policies? Get a count of unique values.

In [15]:
df_policy_duration = df_policy_dimension['Policy Expiration Date'] - df_policy_dimension['Policy Effective Date']
df_policy_duration.value_counts()
Out[15]:
365 days    9999
184 days    5033
366 days    4967
182 days    3365
183 days    3360
181 days    3276
dtype: int64

All policies are either semiannual or annual. Extract this information to a new column.

In [16]:
df_policy_dimension['Duration'] = np.where(df_policy_duration > timedelta(days=200), 'Annual', 'Semiannual')

df_policy_dimension.head(3).style.applymap(new_column_style, subset=['Duration'])
Out[16]:
Policy Number Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration
0 1000001 2014-03-09 00:00:00 2015-03-09 00:00:00 11 4 NaT Annual
1 1000002 2011-04-02 00:00:00 2012-04-02 00:00:00 15 2 NaT Annual
2 1000003 2012-01-29 00:00:00 2012-07-31 00:00:00 3 7 NaT Semiannual
In [17]:
df_policy_dimension['Duration'].value_counts()
Out[17]:
Semiannual    15034
Annual        14966
Name: Duration, dtype: int64

There seems to be an even split between semiannual and annual policies.

Creating merged master dataframes for easier analysis

Risk Fact (one row per driver)

For the risk fact table, there are 2 types of rows - purchase and refund. A positive premium indicates a purhcase of policy, while a negative premium indicates a refund (cancellation).

Let's merge purchase/refund rows into single rows, creating "Refund Amount" and "Net Premium" columns. The end goal is to have 1 row per driver.

In [18]:
# Create a dataframe of driver vs refund amount information
# Note that there is at maximum 1 refund row per driver
# If any driver has more than 1 refund, we will need to groupby "Driver ID" to sum the refund amounts
df_refund_by_driver = df_risk_fact_vehicle_id_fixed[df_risk_fact_vehicle_id_fixed['Total Premium'] < 0][['Driver ID', 'Total Premium']]
df_refund_by_driver.rename(columns={ 'Total Premium': 'Refund Amount' }, inplace=True)
df_refund_by_driver.head()
Out[18]:
Driver ID Refund Amount
35712 4758 -1695.029100
35713 7827 -414.218141
35714 11385 -790.590744
35715 21096 -425.856497
35716 21839 -144.661977
In [19]:
# Add refund amount to purchase rows (where Total Premium > 0)
df_risk_master = df_risk_fact_vehicle_id_fixed[df_risk_fact_vehicle_id_fixed['Total Premium'] > 0].merge(right=df_refund_by_driver, on='Driver ID', how='left')

# Add "Net Premium" column (Total Premium + Refund Amount)
# Treat np.nan as a 0 when adding two numbers
df_risk_master['Net Premium'] = df_risk_master['Total Premium'].add(df_risk_master['Refund Amount'], fill_value=0)

# Display added column
df_risk_master.head(3).style.applymap(new_column_style, subset=['Net Premium'])
Out[19]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium
0 1 1001758 4758 50 2006 65 3450.57 -1695.03 1755.54
1 2 1002901 5901 68 2013 536 1938.57 nan 1938.57
2 3 1009972 12972 43 2005 3174.57 nan 3174.57

Merge policy dimension to risk fact

In [20]:
df_risk_master = df_risk_master.merge(right=df_policy_dimension, how='inner', on='Policy Number').sort_values(by=['Policy Number', 'Driver ID'])

df_risk_master.head(3).style.applymap(new_column_style, 
                                            subset=df_policy_dimension.columns.to_series().drop('Policy Number'))
Out[20]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration
26 11 1000001 3001 34 2011 229 4434.57 nan 4434.57 2014-03-09 00:00:00 2015-03-09 00:00:00 11 4 NaT Annual
28353 22644 1000002 3002 61 2008 401 4795.82 nan 4795.82 2011-04-02 00:00:00 2012-04-02 00:00:00 15 2 NaT Annual
25958 20253 1000003 3003 61 2011 576 1325.81 nan 1325.81 2012-01-29 00:00:00 2012-07-31 00:00:00 3 7 NaT Semiannual

Merge driver dimension to risk fact

In [21]:
df_risk_master = df_risk_master.merge(right=df_driver_dimension, on='Driver ID')
df_risk_master.head(3).style.applymap(new_column_style,
                                           subset=df_driver_dimension.columns.to_series().drop('Driver ID'))
Out[21]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth
0 11 1000001 3001 34 2011 229 4434.57 nan 4434.57 2014-03-09 00:00:00 2015-03-09 00:00:00 11 4 NaT Annual Josefa Turnbill 6 2 S M 28 Leisure 1994-02-28 00:00:00
1 22644 1000002 3002 61 2008 401 4795.82 nan 4795.82 2011-04-02 00:00:00 2012-04-02 00:00:00 15 2 NaT Annual Deidre Whilden 6 2 S M 34 Work 1965-07-04 00:00:00
2 20253 1000003 3003 61 2011 576 1325.81 nan 1325.81 2012-01-29 00:00:00 2012-07-31 00:00:00 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19 Leisure 1973-09-19 00:00:00

Add age of driver based on date of birth and policy effective date

In [22]:
df_risk_master['Age'] = (df_risk_master['Policy Effective Date'].dt.year - df_risk_master['Date of Birth'].dt.year) - 1

# Increase age if effective date has passed the date of birth for that year
df_risk_master['Age'] = df_risk_master['Age'] + (df_risk_master['Policy Effective Date'].dt.strftime('%m%d') > df_risk_master['Date of Birth'].dt.strftime('%m%d'))

# Convert to integer type
df_risk_master['Age'] = df_risk_master['Age'].astype('Int64')

# Check results
df_risk_master[['Policy Effective Date', 'Date of Birth', 'Age']].head(3).style.applymap(new_column_style, subset=['Age'])
Out[22]:
Policy Effective Date Date of Birth Age
0 2014-03-09 00:00:00 1994-02-28 00:00:00 20
1 2011-04-02 00:00:00 1965-07-04 00:00:00 45
2 2012-01-29 00:00:00 1973-09-19 00:00:00 38

Create age bins

In [23]:
df_risk_master['Age Bin'] = pd.cut(df_risk_master['Age'],
                     bins=[9.5, 19.5, 29.5, 39.5, 49.5, 59.5, 69.5, np.Infinity],
                     labels=['10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70+'],
                     include_lowest=True,
                     right=False,
                    )

df_risk_master[['Driver ID', 'Policy Effective Date', 'Date of Birth', 'Age', 'Age Bin']].head(3) \
    .style \
    .applymap(new_column_style, subset='Age Bin')
Out[23]:
Driver ID Policy Effective Date Date of Birth Age Age Bin
0 3001 2014-03-09 00:00:00 1994-02-28 00:00:00 20 20-29
1 3002 2011-04-02 00:00:00 1965-07-04 00:00:00 45 40-49
2 3003 2012-01-29 00:00:00 1973-09-19 00:00:00 38 30-39

Calculate monthly premium based on semiannual/annual classification we've done in policy dimension

In [24]:
df_risk_master['Monthly Premium'] = np.where(df_risk_master['Duration'] == 'Annual',
                                             df_risk_master['Total Premium'] / 12,
                                             df_risk_master['Total Premium'] / 6)

df_risk_master[['Policy Number', 'Driver ID', 'Total Premium', 'Duration', 'Monthly Premium']] \
    .head(3) \
    .style \
    .applymap(new_column_style, subset=['Monthly Premium'])
Out[24]:
Policy Number Driver ID Total Premium Duration Monthly Premium
0 1000001 3001 4434.57 Annual 369.548
1 1000002 3002 4795.82 Annual 399.652
2 1000003 3003 1325.81 Semiannual 220.969

Merge Claim Fact dataframe with Claim Expense Dimension.

In [25]:
df_claims_fact_merged = df_claims_fact.merge(right=df_claim_expense_dimension, on='Claim ID', how='left')
df_claims_fact_merged.head(3).style.applymap(new_column_style, subset=df_claim_expense_dimension.columns.to_series().drop('Claim ID'))
Out[25]:
Claim ID Policy ID Claimant Id Vehicle ID Claim Amount Claim Description Expense Amount Claim Handling Branch
0 1 1025498 28498 78 4130 Damage from impacts with animals (deer, etc.); note: a crash from swerving to avoid animals will likely fall under collision 165.2 3
1 2 1002163 5163 27 13340 Personal Injury 1200.6 2
2 3 1009472 12472 17 2100 Bodily Injury Claims 147 2

For each driver, calculate (1) sum of all claim amounts, (2) sum of all claim expenses, (3) number of claims. This step is to aggregate different claims by a single driver into one row.

In [26]:
df_claims_by_driver = df_claims_fact_merged.groupby(by='Claimant Id', as_index=False) \
    .agg({
        'Claim Description': 'count',
        'Claim Amount': 'sum', 
        'Expense Amount': 'sum'
    }) \
    .rename(columns={
        'Claim Description': 'Claim Count',
        'Claimant Id': 'Driver ID',
        'Expense Amount': 'Claim Expense'
    })

df_claims_by_driver.head(3)
Out[26]:
Driver ID Claim Count Claim Amount Claim Expense
0 1146 1 7470 1419.3
1 1328 1 6400 1088.0
2 3112 1 11180 670.8
In [27]:
# Merge both policy_dimension and df_claims_by_driver into risk_fact
df_risk_master = df_risk_master.merge(right=df_claims_by_driver, on='Driver ID', how='left')

# Convert claim count to an integer type
df_risk_master['Claim Count'] = df_risk_master['Claim Count'].astype('Int64')

df_risk_master.head(3).style.applymap(new_column_style,
                                      subset=df_claims_by_driver.columns.to_series().drop('Driver ID'))
Out[27]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth Age Age Bin Monthly Premium Claim Count Claim Amount Claim Expense
0 11 1000001 3001 34 2011 229 4434.57 nan 4434.57 2014-03-09 00:00:00 2015-03-09 00:00:00 11 4 NaT Annual Josefa Turnbill 6 2 S M 28 Leisure 1994-02-28 00:00:00 20 20-29 369.548 nan nan nan
1 22644 1000002 3002 61 2008 401 4795.82 nan 4795.82 2011-04-02 00:00:00 2012-04-02 00:00:00 15 2 NaT Annual Deidre Whilden 6 2 S M 34 Work 1965-07-04 00:00:00 45 40-49 399.652 nan nan nan
2 20253 1000003 3003 61 2011 576 1325.81 nan 1325.81 2012-01-29 00:00:00 2012-07-31 00:00:00 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19 Leisure 1973-09-19 00:00:00 38 30-39 220.969 nan nan nan

Merge vehicles dimension into risk fact

In [28]:
df_vehicles_dimension.head(3)
Out[28]:
Vehicle ID Car Make Model Average Price
0 1 Nissan Versa 11000
1 2 Chevrolet Spark 13000
2 3 Chevrolet Sonic 15000
In [29]:
df_risk_master = df_risk_master.merge(right=df_vehicles_dimension, on='Vehicle ID', how='left').rename(columns={
    'Car Make': 'Make',
    'Average Price': 'Vehicle Average Price'
})

df_risk_master[['Policy Number', 'Driver ID', 'Make', 'Model', 'Vehicle Average Price']] \
    .head(3) \
    .style \
    .applymap(new_column_style, subset=['Make', 'Model', 'Vehicle Average Price'])
Out[29]:
Policy Number Driver ID Make Model Vehicle Average Price
0 1000001 3001 Dodge Challenger 26600
1 1000002 3002 Audi A3 SportBack 38900
2 1000003 3003 Audi A3 SportBack 38900

Add a column to indicate whether a policy was cancelled

In [30]:
df_risk_master['Cancelled'] = df_risk_master['Cancel Date'].notna()
df_risk_master[['Policy Number', 'Driver ID', 'Cancel Date', 'Cancelled']] \
    .head(3) \
    .style \
    .applymap(new_column_style, subset=['Cancelled'])
Out[30]:
Policy Number Driver ID Cancel Date Cancelled
0 1000001 3001 NaT False
1 1000002 3002 NaT False
2 1000003 3003 NaT False

Merge location dimension into risk fact

In [31]:
print('Before converting column data types')
print(f"df_risk_master['Location ID'].dtype: {df_risk_master['Location ID'].dtype}")
print(f"df_location_dimension['Location ID'].dtype: {df_location_dimension['Location ID'].dtype}")
Before converting column data types
df_risk_master['Location ID'].dtype: object
df_location_dimension['Location ID'].dtype: int64

However, the dtypes of df_risk_fact_cleaned and df_location_dimension are different as shown above. This will result in an error when merging the two dataframes by Location ID. Cast both to nullable integer types (Int64) first.

In [32]:
df_risk_master['Location ID'] = pd.to_numeric(df_risk_fact['Location ID'], errors='coerce').astype('Int64')
df_location_dimension['Location ID'] = df_location_dimension['Location ID'].astype('Int64')
print('After converting column data types')
print(f"df_risk_master['Location ID'].dtype: {df_risk_master['Location ID'].dtype}")
print(f"df_location_dimension['Location ID'].dtype: {df_location_dimension['Location ID'].dtype}")
After converting column data types
df_risk_master['Location ID'].dtype: Int64
df_location_dimension['Location ID'].dtype: Int64

Now, we should be able to merge.

In [33]:
df_risk_master = df_risk_master.merge(df_location_dimension, on='Location ID', how='left')
df_risk_master \
    .head(3)\
    .style \
    .applymap(new_column_style, subset=['ZIP', 'Primary_City', 'State', 'County', 'Estimated_population_2013'])
Out[33]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth Age Age Bin Monthly Premium Claim Count Claim Amount Claim Expense Make Model Vehicle Average Price Cancelled ZIP Primary_City State County Estimated_population_2013
0 11 1000001 3001 34 2011 65 4434.57 nan 4434.57 2014-03-09 00:00:00 2015-03-09 00:00:00 11 4 NaT Annual Josefa Turnbill 6 2 S M 28 Leisure 1994-02-28 00:00:00 20 20-29 369.548 nan nan nan Dodge Challenger 26600 False 92630 Lake Forest CA Orange County 56565
1 22644 1000002 3002 61 2008 536 4795.82 nan 4795.82 2011-04-02 00:00:00 2012-04-02 00:00:00 15 2 NaT Annual Deidre Whilden 6 2 S M 34 Work 1965-07-04 00:00:00 45 40-49 399.652 nan nan nan Audi A3 SportBack 38900 False 90222 Compton CA Los Angeles County 26204
2 20253 1000003 3003 61 2011 nan 1325.81 nan 1325.81 2012-01-29 00:00:00 2012-07-31 00:00:00 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19 Leisure 1973-09-19 00:00:00 38 30-39 220.969 nan nan nan Audi A3 SportBack 38900 False nan nan nan nan nan

Keep in mind that there is one row where the Location ID is missing.

In [34]:
df_risk_master[df_risk_master['Location ID'].isna()]
Out[34]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth Age Age Bin Monthly Premium Claim Count Claim Amount Claim Expense Make Model Vehicle Average Price Cancelled ZIP Primary_City State County Estimated_population_2013
2 20253 1000003 3003 61 2011 NaN 1325.814736 NaN 1325.814736 2012-01-29 2012-07-31 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19.0 Leisure 1973-09-19 38 30-39 220.969123 NaN NaN NaN Audi A3 SportBack 38900 False NaN NaN NaN NaN NaN
In [35]:
df_risk_master.rename(columns={
    'Primary_City': 'City', 
    'Estimated_population_2013': 'Estimated Population'
}, inplace=True) \

df_risk_master.head(3)
Out[35]:
Record ID Policy Number Driver ID Vehicle ID Vehicle Model Year Location ID Total Premium Refund Amount Net Premium Policy Effective Date Policy Expiration Date Agency ID Underwriting Company ID Cancel Date Duration First Name Last Name Number of Violations Number of Accidents Marital Status Gender Miles to work Primary Vehicle Use Date of Birth Age Age Bin Monthly Premium Claim Count Claim Amount Claim Expense Make Model Vehicle Average Price Cancelled ZIP City State County Estimated Population
0 11 1000001 3001 34 2011 65 4434.570900 NaN 4434.570900 2014-03-09 2015-03-09 11 4 NaT Annual Josefa Turnbill 6 2 S M 28.0 Leisure 1994-02-28 20 20-29 369.547575 NaN NaN NaN Dodge Challenger 26600 False 92630.0 Lake Forest CA Orange County 56565.0
1 22644 1000002 3002 61 2008 536 4795.820215 NaN 4795.820215 2011-04-02 2012-04-02 15 2 NaT Annual Deidre Whilden 6 2 S M 34.0 Work 1965-07-04 45 40-49 399.651685 NaN NaN NaN Audi A3 SportBack 38900 False 90222.0 Compton CA Los Angeles County 26204.0
2 20253 1000003 3003 61 2011 NaN 1325.814736 NaN 1325.814736 2012-01-29 2012-07-31 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19.0 Leisure 1973-09-19 38 30-39 220.969123 NaN NaN NaN Audi A3 SportBack 38900 False NaN NaN NaN NaN NaN

Claims Fact (one row per claim)

Create a master claims fact dataframe.

In [36]:
df_claims_master = df_claims_fact_merged \
    .merge(
        right=df_claim_branches, 
        left_on='Claim Handling Branch',
        right_on='Branch Id',
        how='left'
    ) \
    .drop(columns=['Claim Handling Branch']) \
    .rename(columns={ 
        'Expense Amount': 'Claim Expense',
        'Branch Id': 'Branch ID',
        'Claimant Id': 'Driver ID'
    })

df_claims_master.head(3)
Out[36]:
Claim ID Policy ID Driver ID Vehicle ID Claim Amount Claim Description Claim Expense Branch ID Branch Name
0 1 1025498 28498 78 4130 Damage from impacts with animals (deer, etc.);... 165.2 3 Norwak CA Branch
1 2 1002163 5163 27 13340 Personal Injury 1200.6 2 Watsonville CA Branch
2 3 1009472 12472 17 2100 Bodily Injury Claims 147.0 2 Watsonville CA Branch

Potential erroneous parts of data

Remember, the data was artificially created for this project. This section explores any odd parts that seem unrealistic.

Can a driver have a refund amount larger than the paid amount?

Check for drivers where the sum of premium grouped by that driver ID is negative.

In [37]:
premiums = df_risk_master.groupby('Driver ID')['Net Premium'].sum().to_frame().reset_index()
negative_premiums = premiums[premiums['Net Premium'] < 0]

display(negative_premiums.head().style.applymap(highlight_style, subset=['Net Premium']))
print(f'{negative_premiums.shape[0]} drivers have negative sum of premiums.')
Driver ID Net Premium
252 3248 -94.8582
679 3675 -9.21436
740 3736 -10.6938
831 3827 -95.2198
1299 4295 -31.1103
90 drivers have negative sum of premiums.

There are policies that have refund amounts larger than the payment amount.

In [38]:
# Split the dataframe by whether net premium is positive negative
df_risk_fact_positive_premiums = df_risk_master[df_risk_master['Net Premium'] > 0]
df_risk_fact_negative_premiums = df_risk_master[df_risk_master['Net Premium'] < 0]

df_risk_fact_negative_premiums[['Policy Number', 'Driver ID', 'Total Premium', 'Refund Amount', 'Net Premium', 'Duration']] \
    .head() \
    .style \
    .applymap(highlight_style, subset=['Duration'])
Out[38]:
Policy Number Driver ID Total Premium Refund Amount Net Premium Duration
293 1000248 3248 2142.57 -2237.43 -94.8582 Semiannual
787 1000675 3675 1579.89 -1589.1 -9.21436 Semiannual
863 1000736 3736 1912.6 -1923.29 -10.6938 Semiannual
967 1000827 3827 999.842 -1095.06 -95.2198 Semiannual
1517 1001295 4295 1688.77 -1719.88 -31.1103 Semiannual

These negative net premium rows seem to have semianuual duration. Let's look at the exact proportions.

In [39]:
fig = make_subplots(1, 2, specs=[[{'type': 'domain'}, {'type': 'domain'}]], subplot_titles=['Positive Net Premiums', 'Negative Net Premiums'])
pie_labels = ['Annual', 'Semiannual']

fig.add_trace(
    go.Pie(
        labels=pie_labels,
        values=df_risk_fact_positive_premiums['Duration'].value_counts().sort_index(),
        text=pie_labels,
        textposition='inside',
        insidetextorientation='horizontal',
        pull=[0.0, 0, 0.1, 0],
        sort=False,
        hole=0.3
    ), 1, 1
)

fig.add_trace(
    go.Pie(
        labels=pie_labels,
        values=df_risk_fact_negative_premiums['Duration'].value_counts().sort_index(),
        text=pie_labels,
        textposition='inside',
        insidetextorientation='horizontal',
        pull=[0.0, 0, 0.1, 0],
        sort=False,
        hole=0.3,
    ), 1, 2
)

fig.update_layout(title_text='Durations by Positive/Negative Net Premiums')
fig.show()
Conclusion 90 drivers have refunds larger than the payment amount, resulting in a negative net premium. Negative net premium policies are mostly semiannual policies.

Does any policy that has not been canceled have a refund issued?

In [40]:
df_refund_without_cancellation = df_risk_master[(df_risk_master['Refund Amount'].notna()) & (df_risk_master['Cancel Date'].isna())]
df_refund_without_cancellation = df_refund_without_cancellation[['Policy Number',
                                                                 'Driver ID',
                                                                 'Total Premium',
                                                                 'Refund Amount',
                                                                 'Policy Effective Date',
                                                                 'Policy Expiration Date',
                                                                 'Cancelled',
                                                                 'Cancel Date',
                                                                 'Duration']]

df_refund_without_cancellation \
    .head() \
    .style \
    .applymap(highlight_style, subset=['Refund Amount', 'Cancelled'])
Out[40]:
Policy Number Driver ID Total Premium Refund Amount Policy Effective Date Policy Expiration Date Cancelled Cancel Date Duration
15 1000014 2218629 3678.57 -2660.72 2013-02-23 00:00:00 2014-02-23 00:00:00 False NaT Annual
164 1000141 2217066 1684.17 -1589.43 2013-05-21 00:00:00 2013-11-18 00:00:00 False NaT Semiannual
311 1000262 2217302 1519.39 -1079.18 2014-01-25 00:00:00 2014-07-28 00:00:00 False NaT Semiannual
392 1000331 48243 910.582 -676.574 2013-02-07 00:00:00 2013-08-07 00:00:00 False NaT Semiannual
496 1000419 49350 2910.57 -2660.12 2012-01-28 00:00:00 2013-01-27 00:00:00 False NaT Annual

Find out how much percentage of the original premium has been refunded for each driver.

In [41]:
df_refund_without_cancellation['Refund Percentage'] = -df_refund_without_cancellation['Refund Amount'] / df_refund_without_cancellation['Total Premium']
df_refund_without_cancellation \
    .head() \
    .style \
    .applymap(highlight_style, subset='Refund Percentage')
Out[41]:
Policy Number Driver ID Total Premium Refund Amount Policy Effective Date Policy Expiration Date Cancelled Cancel Date Duration Refund Percentage
15 1000014 2218629 3678.57 -2660.72 2013-02-23 00:00:00 2014-02-23 00:00:00 False NaT Annual 0.723302
164 1000141 2217066 1684.17 -1589.43 2013-05-21 00:00:00 2013-11-18 00:00:00 False NaT Semiannual 0.943746
311 1000262 2217302 1519.39 -1079.18 2014-01-25 00:00:00 2014-07-28 00:00:00 False NaT Semiannual 0.710272
392 1000331 48243 910.582 -676.574 2013-02-07 00:00:00 2013-08-07 00:00:00 False NaT Semiannual 0.743013
496 1000419 49350 2910.57 -2660.12 2012-01-28 00:00:00 2013-01-27 00:00:00 False NaT Annual 0.913953
In [42]:
print(f'There are {df_refund_without_cancellation.shape[0]} refunds issued without a cancellation.')
There are 351 refunds issued without a cancellation.
Conclusion
These may indicate partial refunds for premium adjustments. For example, a driver may reduce the amount of deductibles to lower the premium. However, that doesn't seem to be the case. On average, 78% of the total premium was refunded for these drivers. That is an extraordinarily large proportion for premium adjustments.

One possibility is that the cancellation dates for these rows were omitted.

A driver must be at least 15.5 years old to apply for a learner's permit in California. That means that anyone under 15 can't purchase an auto insurance policy.

In [43]:
df_driver_underage = df_risk_master[df_risk_master['Age'] < 15]

df_driver_underage = df_driver_underage[['Policy Number', 'Driver ID', 'Age', 'Policy Effective Date', 'Date of Birth', 'Marital Status', 'Primary Vehicle Use','Gender']]

df_driver_underage.head() \
    .style \
    .applymap(highlight_style, subset=['Age'])
Out[43]:
Policy Number Driver ID Age Policy Effective Date Date of Birth Marital Status Primary Vehicle Use Gender
11 1000011 3011 14 2011-05-02 00:00:00 1996-05-15 00:00:00 S Work M
15 1000014 2218629 14 2013-02-23 00:00:00 1998-04-05 00:00:00 S Work F
18 1000016 3016 14 2011-07-31 00:00:00 1996-08-05 00:00:00 S Work F
56 1000049 3049 12 2011-12-06 00:00:00 1999-05-17 00:00:00 S Work M
76 1000066 3066 12 2011-12-14 00:00:00 1999-07-02 00:00:00 S Work M
In [44]:
print(f'There are {df_driver_underage.shape[0]} drivers less than 15 years old.')
print(f"{df_driver_underage[df_driver_underage['Marital Status'] == 'M'].shape[0]} of them are married according to the data.")
There are 1647 drivers less than 15 years old.
373 of them are married according to the data.
In [45]:
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

fig.suptitle('Drivers Under 15')
fig.tight_layout(pad=3.0)

sns.countplot(x='Age',
              data=df_driver_underage,
              color='#444444',
              ax=axs[0][0])

sns.countplot(x='Marital Status',
              data=df_driver_underage,
              color='#444444',
              ax=axs[0][1])

sns.countplot(x='Gender',
              data=df_driver_underage,
              color='#444444',
              ax=axs[1][0])

sns.countplot(x='Primary Vehicle Use',
              data=df_driver_underage,
              color='#444444',
              ax=axs[1][1])
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x256b47fa630>
Conclusion
There are 1647 drivers who are under the age of 15 (minimum age to apply for a provisional driver license in California). 373 of them have married status.

It is pretty clear that the birth dates for these rows are inaccurate.

Were there any vehicles that weren't produced on policy effective date?

Usually, vehicles for year N is released a year before (N - 1). For example, Corolla 2021 will be released in 2020. So, it is okay in some cases to have vehicle model year being larger than the policy effective year. However, it doesn't make sense if the vehicle model year is larger than the policy effective year by 2 or larger.

In [46]:
df_invalid_vehicle_model_years = df_risk_master[df_risk_master['Vehicle Model Year'] >= df_risk_master['Policy Effective Date'].dt.year + 2] \
    [['Policy Number', 'Vehicle ID', 'Vehicle Model Year', 'Policy Effective Date']] \

df_invalid_vehicle_model_years['Policy Effective Year'] = df_invalid_vehicle_model_years['Policy Effective Date'].dt.year

display(
    df_invalid_vehicle_model_years.head() \
        .style \
        .applymap(highlight_style, subset=['Vehicle Model Year', 'Policy Effective Year'])
)

print(f'There are {df_invalid_vehicle_model_years.shape[0]} rows where the vehicle model year is larger than the policy effective year by at least two.')
Policy Number Vehicle ID Vehicle Model Year Policy Effective Date Policy Effective Year
11 1000011 1 2013 2011-05-02 00:00:00 2011
18 1000016 87 2013 2011-07-31 00:00:00 2011
43 1000037 53 2013 2011-07-27 00:00:00 2011
54 1000047 39 2013 2011-12-19 00:00:00 2011
95 1000083 46 2012 2010-11-20 00:00:00 2010
There are 1302 rows where the vehicle model year is larger than the policy effective year by at least two.
Conclusion There are 1302 rows where the vehicle model year is larger than the policy effective year by at least two.

Examining/Visualizing factors that affect the Premium

Is using the total premium as the dependent variable appropriate?

Before starting to explore the factors that affect the premiums, let's first determine whether using the total premium is appropriate. To do that, I create box plots by number of violations on total premium and monthly premium.

In [47]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

fig.suptitle('Total vs Monthly premium distributions by # of violations')

sns.boxplot(x='Number of Violations',
            y='Total Premium',
            data=df_risk_master[df_risk_master['Net Premium'] > 0],
            palette='Blues',
            ax=ax1)

sns.boxplot(x='Number of Violations',
            y='Monthly Premium',
            data=df_risk_master[df_risk_master['Net Premium'] > 0],
            palette='Blues',
            ax=ax2)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x256ba6057b8>

The left plot does show a slight sign of positive linear relationship between the number of violations and total premium. However, the distributions on the left plot are skewed towards the right top since the total premium columns contain both semiannual/annual policies.

Conclusion Using the monthly premium as a dependent variable will give us a better fit with less skewness.

Monthly premium distribution by various categories

In [48]:
fig, axs = plt.subplots(4, 2, figsize=(16, 24))
fig.tight_layout(pad=3.0)

def display_premium_distribution(data, by, ax):
    return sns.boxplot(data=data,
                      x=by,
                      y='Monthly Premium', 
                      palette='Blues', 
                      ax=ax)

display_premium_distribution(data=df_risk_master, by='Gender', ax=axs[0][0])
display_premium_distribution(data=df_risk_master, by='Primary Vehicle Use', ax=axs[0][1])
display_premium_distribution(data=df_risk_master, by='Marital Status', ax=axs[1][0])
display_premium_distribution(data=df_risk_master, by='Duration', ax=axs[1][1])
display_premium_distribution(data=df_risk_master, by='Number of Violations', ax=axs[2][0])
display_premium_distribution(data=df_risk_master, by='Number of Accidents', ax=axs[2][1])
display_premium_distribution(data=df_risk_master, by='Vehicle Model Year', ax=axs[3][0])
display_premium_distribution(data=df_risk_master, by='Age Bin', ax=axs[3][1])
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x256b814b9b0>

The left plot does show a slight sign of positive linear relationship between the number of violations and total premium. However, the distributions on the left plot are skewed towards the right top since the total premium columns contain both semiannual/annual policies.

Analysis Number of violations, number of accidents, and vehicle model year have most visible (and linear) impact on the premium.

Premium pricing model using linear regression

Replace spaces in column names with underscores

In [49]:
df_risk_renamed = df_risk_master.copy() \
    .rename(columns={ col: col.strip().replace(' ', '_') for col in df_risk_master.columns })

df_risk_renamed.head(3)
Out[49]:
Record_ID Policy_Number Driver_ID Vehicle_ID Vehicle_Model_Year Location_ID Total_Premium Refund_Amount Net_Premium Policy_Effective_Date Policy_Expiration_Date Agency_ID Underwriting_Company_ID Cancel_Date Duration First_Name Last_Name Number_of_Violations Number_of_Accidents Marital_Status Gender Miles_to_work Primary_Vehicle_Use Date_of_Birth Age Age_Bin Monthly_Premium Claim_Count Claim_Amount Claim_Expense Make Model Vehicle_Average_Price Cancelled ZIP City State County Estimated_Population
0 11 1000001 3001 34 2011 65 4434.570900 NaN 4434.570900 2014-03-09 2015-03-09 11 4 NaT Annual Josefa Turnbill 6 2 S M 28.0 Leisure 1994-02-28 20 20-29 369.547575 NaN NaN NaN Dodge Challenger 26600 False 92630.0 Lake Forest CA Orange County 56565.0
1 22644 1000002 3002 61 2008 536 4795.820215 NaN 4795.820215 2011-04-02 2012-04-02 15 2 NaT Annual Deidre Whilden 6 2 S M 34.0 Work 1965-07-04 45 40-49 399.651685 NaN NaN NaN Audi A3 SportBack 38900 False 90222.0 Compton CA Los Angeles County 26204.0
2 20253 1000003 3003 61 2011 NaN 1325.814736 NaN 1325.814736 2012-01-29 2012-07-31 3 7 NaT Semiannual Fernande Lashbrook 2 0 S F 19.0 Leisure 1973-09-19 38 30-39 220.969123 NaN NaN NaN Audi A3 SportBack 38900 False NaN NaN NaN NaN NaN
In [50]:
linreg_formula = 'Monthly_Premium ~ Number_of_Violations + ' + \
    'Number_of_Accidents + Vehicle_Model_Year + ' \
    'Duration + ' \
    'C(Gender) + C(Age_Bin) + C(Marital_Status) + C(Primary_Vehicle_Use) + ' + \
    'C(Make) + C(Model)'
results = smf.ols(linreg_formula, data=df_risk_renamed).fit()
results.summary()
Out[50]:
OLS Regression Results
Dep. Variable: Monthly_Premium R-squared: 0.665
Model: OLS Adj. R-squared: 0.664
Method: Least Squares F-statistic: 614.8
Date: Mon, 18 May 2020 Prob (F-statistic): 0.00
Time: 04:43:17 Log-Likelihood: -1.6017e+05
No. Observations: 31115 AIC: 3.205e+05
Df Residuals: 31014 BIC: 3.214e+05
Df Model: 100
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.899e+04 176.492 164.273 0.000 2.86e+04 2.93e+04
Duration[T.Semiannual] -7.4524 0.474 -15.735 0.000 -8.381 -6.524
C(Gender)[T.M] 2.9616 0.492 6.024 0.000 1.998 3.925
C(Age_Bin)[T.20-29] -0.6068 0.698 -0.869 0.385 -1.975 0.762
C(Age_Bin)[T.30-39] -1.9681 0.771 -2.553 0.011 -3.479 -0.457
C(Age_Bin)[T.40-49] -4.5835 0.795 -5.766 0.000 -6.142 -3.025
C(Age_Bin)[T.50-59] -5.0514 0.980 -5.156 0.000 -6.972 -3.131
C(Age_Bin)[T.60-69] -1.9734 1.836 -1.075 0.282 -5.572 1.625
C(Age_Bin)[T.70+] 4.9663 5.818 0.854 0.393 -6.438 16.371
C(Marital_Status)[T.M] -5.4827 1.081 -5.073 0.000 -7.601 -3.365
C(Marital_Status)[T.S] 8.6139 1.017 8.466 0.000 6.620 10.608
C(Primary_Vehicle_Use)[T.Work] 0.1484 0.547 0.272 0.786 -0.923 1.220
C(Make)[T.Audi] 1133.9028 6.908 164.145 0.000 1120.363 1147.443
C(Make)[T.BMW] 1220.2117 7.404 164.805 0.000 1205.700 1234.724
C(Make)[T.Chevrolet] 1169.0011 7.257 161.083 0.000 1154.777 1183.225
C(Make)[T.Chrysler] 1413.8131 8.939 158.154 0.000 1396.291 1431.335
C(Make)[T.Dodge] 1121.2100 7.002 160.128 0.000 1107.486 1134.934
C(Make)[T.Ford] 1175.6929 7.235 162.495 0.000 1161.512 1189.874
C(Make)[T.Honda] 1174.2124 7.257 161.798 0.000 1159.988 1188.437
C(Make)[T.Hyundai] 939.7434 5.863 160.283 0.000 928.252 951.235
C(Make)[T.Infiniti] 1182.5268 7.192 164.432 0.000 1168.431 1196.623
C(Make)[T.Jaguar] 1136.6538 6.912 164.446 0.000 1123.106 1150.202
C(Make)[T.Jeep] 1204.1950 7.458 161.457 0.000 1189.576 1218.814
C(Make)[T.Kia] 1056.2901 6.540 161.507 0.000 1043.471 1069.109
C(Make)[T.Land Rover] 1135.4186 6.931 163.816 0.000 1121.833 1149.004
C(Make)[T.Lexus] 1077.2466 6.515 165.341 0.000 1064.476 1090.017
C(Make)[T.Mazda] 1169.3112 7.263 160.999 0.000 1155.076 1183.547
C(Make)[T.Mercedes-Benz] 1146.5698 6.939 165.245 0.000 1132.970 1160.170
C(Make)[T.Mitsubishi] 1124.5944 6.992 160.837 0.000 1110.890 1138.299
C(Make)[T.Nissan] 1121.3412 7.003 160.118 0.000 1107.615 1135.068
C(Make)[T.TOYOTA] 1207.5255 7.437 162.375 0.000 1192.949 1222.102
C(Model)[T.228i] 195.8245 2.326 84.182 0.000 191.265 200.384
C(Model)[T.300 Series] 0.3542 2.973 0.119 0.905 -5.472 6.181
C(Model)[T.328i] 192.2034 2.155 89.208 0.000 187.980 196.426
C(Model)[T.535i] 209.0868 2.047 102.126 0.000 205.074 213.100
C(Model)[T.650 Grand Coupe] 224.1986 2.212 101.340 0.000 219.862 228.535
C(Model)[T.A3 SportBack] 283.5480 2.394 118.429 0.000 278.855 288.241
C(Model)[T.A5 Coupe] 283.2878 2.430 116.569 0.000 278.524 288.051
C(Model)[T.Accord] 229.0976 3.366 68.053 0.000 222.499 235.696
C(Model)[T.Altima] 279.8566 3.386 82.660 0.000 273.221 286.493
C(Model)[T.Azera] 474.6752 3.319 143.001 0.000 468.169 481.181
C(Model)[T.C Class Sedan] 268.5989 2.405 111.689 0.000 263.885 273.313
C(Model)[T.C-Max Hybrid] 234.6247 2.495 94.044 0.000 229.735 239.515
C(Model)[T.CRV] 228.9633 3.231 70.863 0.000 222.630 235.296
C(Model)[T.CX-5] 234.2465 3.209 72.992 0.000 227.956 240.537
C(Model)[T.CX-9] 245.0830 2.496 98.189 0.000 240.191 249.975
C(Model)[T.Camry] 202.0099 2.356 85.755 0.000 197.393 206.627
C(Model)[T.Challenger] 292.7529 2.872 101.944 0.000 287.124 298.382
C(Model)[T.Cherokee] 205.1956 2.540 80.801 0.000 200.218 210.173
C(Model)[T.Civic] 237.5560 2.615 90.839 0.000 232.430 242.682
C(Model)[T.Compass] 191.3977 3.208 59.656 0.000 185.109 197.686
C(Model)[T.Corolla iM] 194.5576 3.075 63.264 0.000 188.530 200.585
C(Model)[T.Dart] 279.7143 3.332 83.944 0.000 273.183 286.245
C(Model)[T.E Class Sedan] 274.0815 2.338 117.240 0.000 269.499 278.664
C(Model)[T.EVOQUE] 266.2895 3.087 86.260 0.000 260.239 272.340
C(Model)[T.Elantra] 465.0682 3.639 127.815 0.000 457.936 472.200
C(Model)[T.Expedition] 242.8773 2.331 104.179 0.000 238.308 247.447
C(Model)[T.F-PACE] 277.2292 2.371 116.947 0.000 272.583 281.876
C(Model)[T.Flex] 236.8966 2.409 98.341 0.000 232.175 241.618
C(Model)[T.Focus SE] 228.0113 3.258 69.984 0.000 221.625 234.397
C(Model)[T.Fusion] 233.2830 2.572 90.710 0.000 228.242 238.324
C(Model)[T.GS] 359.7494 2.752 130.736 0.000 354.356 365.143
C(Model)[T.Grand Caravan] 272.4711 3.257 83.665 0.000 266.088 278.854
C(Model)[T.Grand Cherokee] 212.9545 2.395 88.934 0.000 208.261 217.648
C(Model)[T.ILX] 1411.4489 8.903 158.533 0.000 1393.998 1428.899
C(Model)[T.Impala] 242.8855 2.577 94.258 0.000 237.835 247.936
C(Model)[T.Journey] 276.2718 3.427 80.615 0.000 269.555 282.989
C(Model)[T.Kia Forte] 343.4661 3.316 103.563 0.000 336.966 349.967
C(Model)[T.Kia Sorento] 357.1244 2.777 128.605 0.000 351.682 362.567
C(Model)[T.Kia Sportage] 355.6996 2.877 123.644 0.000 350.061 361.338
C(Model)[T.LR4] 286.7856 2.278 125.905 0.000 282.321 291.250
C(Model)[T.LS] 354.5457 2.607 136.012 0.000 349.436 359.655
C(Model)[T.LX] 362.9515 2.749 132.017 0.000 357.563 368.340
C(Model)[T.Lancer] 278.8535 3.257 85.611 0.000 272.469 285.238
C(Model)[T.MDX] 1416.7493 8.758 161.758 0.000 1399.582 1433.916
C(Model)[T.Malibu] 228.6326 3.197 71.520 0.000 222.367 234.898
C(Model)[T.Maxima] 293.3649 2.574 113.991 0.000 288.321 298.409
C(Model)[T.Maybatch] 307.2814 2.497 123.038 0.000 302.386 312.177
C(Model)[T.Mazda3] 225.8430 3.215 70.254 0.000 219.542 232.144
C(Model)[T.Mazda5] 231.0737 2.590 89.231 0.000 225.998 236.149
C(Model)[T.Mazda6] 233.0651 3.175 73.416 0.000 226.843 239.287
C(Model)[T.MiEV] 282.2346 3.303 85.452 0.000 275.761 288.708
C(Model)[T.Mirage] 273.8718 3.266 83.851 0.000 267.470 280.274
C(Model)[T.New Discovery] 283.6984 2.477 114.522 0.000 278.843 288.554
C(Model)[T.Odyssey] 237.0538 2.709 87.493 0.000 231.743 242.364
C(Model)[T.OutLander] 289.6344 2.776 104.334 0.000 284.193 295.076
C(Model)[T.Patriot] 193.0886 3.132 61.657 0.000 186.950 199.227
C(Model)[T.Pilot] 241.5418 2.621 92.151 0.000 236.404 246.679
C(Model)[T.Prius] 201.3769 2.161 93.179 0.000 197.141 205.613
C(Model)[T.Q5] 282.5620 2.386 118.428 0.000 277.885 287.239
C(Model)[T.Q50] 234.4255 2.224 105.427 0.000 230.067 238.784
C(Model)[T.Q60] 236.4569 2.288 103.355 0.000 231.973 240.941
C(Model)[T.Q70] 236.7477 2.292 103.292 0.000 232.255 241.240
C(Model)[T.QX60] 240.5685 2.249 106.988 0.000 236.161 244.976
C(Model)[T.QX70] 234.3283 2.274 103.060 0.000 229.872 238.785
C(Model)[T.RAV4] 207.1610 2.235 92.691 0.000 202.780 211.542
C(Model)[T.RDX] 1416.4555 8.789 161.157 0.000 1399.228 1433.683
C(Model)[T.RLX] 1421.6954 8.755 162.388 0.000 1404.535 1438.855
C(Model)[T.Range Rover] 298.6451 2.348 127.178 0.000 294.042 303.248
C(Model)[T.Renegare] 194.0677 3.129 62.022 0.000 187.935 200.201
C(Model)[T.S Class Sedan] 296.6080 2.441 121.514 0.000 291.824 301.392
C(Model)[T.S3] 284.5051 2.418 117.682 0.000 279.767 289.244
C(Model)[T.Sentra] 274.5372 3.344 82.102 0.000 267.983 281.091
C(Model)[T.Sienna] 210.0951 2.223 94.499 0.000 205.737 214.453
C(Model)[T.Sonic] 227.1643 3.220 70.538 0.000 220.852 233.477
C(Model)[T.Spark] 226.2437 3.159 71.621 0.000 220.052 232.435
C(Model)[T.TLX] 1416.9572 8.780 161.390 0.000 1399.749 1434.166
C(Model)[T.Town and County] -3.0039 3.090 -0.972 0.331 -9.060 3.052
C(Model)[T.Versa] 273.5826 3.332 82.102 0.000 267.051 280.114
C(Model)[T.Volt] 244.0750 2.350 103.850 0.000 239.468 248.682
C(Model)[T.Wrangler] 207.4908 2.469 84.025 0.000 202.651 212.331
C(Model)[T.X3] 194.8729 2.105 92.563 0.000 190.746 198.999
C(Model)[T.X5] 204.0254 2.045 99.774 0.000 200.017 208.033
C(Model)[T.XE] 279.2129 2.290 121.929 0.000 274.725 283.701
C(Model)[T.XF] 282.9051 2.360 119.865 0.000 278.279 287.531
C(Model)[T.XJ] 297.3066 2.317 128.313 0.000 292.765 301.848
C(Model)[T.Yaris] 192.3251 3.166 60.740 0.000 186.119 198.531
Number_of_Violations 15.1240 0.136 110.893 0.000 14.857 15.391
Number_of_Accidents 30.5705 0.299 102.326 0.000 29.985 31.156
Vehicle_Model_Year -15.0456 0.092 -163.303 0.000 -15.226 -14.865
Omnibus: 777.273 Durbin-Watson: 1.729
Prob(Omnibus): 0.000 Jarque-Bera (JB): 395.716
Skew: 0.036 Prob(JB): 1.18e-86
Kurtosis: 2.452 Cond. No. 1.33e+18


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.11e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Preliminary Loss Ratio Analysis

Here, I look at the loss ratios grouped by each dependent variable.

Overall Loss Ratio

In [51]:
overall_combined_ratio_with_expenses = (df_risk_master['Claim Amount'].sum() + df_risk_master['Claim Expense'].sum()) / df_risk_master['Net Premium'].sum()
print(f'Overall combined ratio without claim expenses is {overall_combined_ratio_with_expenses * 100:.4f}%.')
Overall combined ratio without claim expenses is 21.1209%.
In [52]:
overall_loss_ratio_without_expenses = df_risk_master['Claim Amount'].sum() / df_risk_master['Net Premium'].sum()
print(f'Overall loss ratio with claim expenses is {overall_loss_ratio_without_expenses * 100:.4f}%.')
Overall loss ratio with claim expenses is 19.2286%.

Utility functions to calculate/visualize loss ratios by different criteria

A few functions are defined here to calculate and visualize performance metrics.

In [53]:
def get_performance_metrics(df_risk_merged, by):
    '''
    A function to calculate the loss ratios grouped by provided column names
    
    Parameters:
        df_risk_merged (DataFrame): A merged/cleaned DataFrame of Risk Fact, 
                                    Policy Dimension, Driver Dimension, Claims Fact, 
                                    Claim Expense Dimension
                                    The Dataframe should have one row per driver
        by (List<str>): A list of column names to group by
        
    Returns:
    DataFrame: A dataframe containing performance metrics (Claim Expense Ratio, Average Monthly Premium, Average Claim Amount, Loss Ratio)
    '''
    
    df_g = df_risk_merged.groupby(by=by).agg({
        'Net Premium': 'sum',
        'Claim Amount': 'sum',
        'Claim Count': 'sum',
        'Claim Expense': 'sum',
        'Monthly Premium': 'sum', 
        'Record ID': 'size'
    }).reset_index()
    
    # Record ID column was used to count the number of drivers within each group
    df_g.rename(columns={ 
        'Record ID': 'Number of Drivers',
        'Claim Amount': 'Total Claim Amount', 
        'Claim Expense': 'Total Claim Expense',
        'Net Premium': 'Total Premium'
    }, inplace=True)
    
    # Calculate loss/combined ratios
    df_g['Loss Ratio'] = df_g['Total Claim Amount'] / df_g['Total Premium']
    df_g['Combined Ratio'] = (df_g['Total Claim Amount'] + df_g['Total Claim Expense']) / df_g['Total Premium']
    
    # Sort result by combined ratio in descending order
    df_g.sort_values('Combined Ratio', ascending=False, inplace=True)
    
    # Calculate other metrics
    df_g['Driver %'] = df_g['Number of Drivers'] / df_risk_merged.shape[0]
    df_g['Average Monthly Premium'] = df_g['Monthly Premium'] / df_g['Number of Drivers']
    df_g['Claim %'] = df_g['Claim Count'] / df_g['Number of Drivers']
    df_g['Average Claim Amount'] = df_g['Total Claim Amount'] / df_g['Claim Count']
    df_g['Average Claim Expense'] = df_g['Total Claim Expense'] / df_g['Claim Count']
    
    # Reorder columns for readability
    column_order = by if type(by) == list else [by]
    column_order.extend(['Number of Drivers',
                         'Driver %',
                         'Total Premium',
                         'Monthly Premium',
                         'Claim Count',
                         'Claim %',
                         'Total Claim Amount',
                         'Total Claim Expense',
                         'Average Monthly Premium',
                         'Average Claim Amount',
                         'Average Claim Expense',
                         'Loss Ratio',
                         'Combined Ratio'])
    
    df_g = df_g[column_order]
    
    return df_g
In [54]:
def display_performance_metrics(df):
    '''
    Display performance metrics with background bar plots
    
    Parameters:
    df (DataFrame): A dataframe containing the performance metrics
    
    Returns:
    DataFrame: A styled dataframe with bar plots as a background
    '''
    # Create a cloned instance of the DataFrame to avoid side-effects
    df = df.copy()
    
    # Columns that are hidden (not displayed)
    hidden_columns = ['Monthly Premium']
    hidden_columns_matched = [col for col in hidden_columns if col in df.columns]
    
    # Drop hidden columns
    df = df.drop(columns=hidden_columns_matched)
    
    # Metric columns display order
    metric_columns_order = ['Number of Drivers',
                            'Driver %',
                            'Total Premium',
                            'Average Monthly Premium',
                            'Claim Count',
                            'Claim %',
                            'Total Claim Amount',
                            'Total Claim Expense',
                            'Average Claim Amount',
                            'Average Claim Expense',
                            'Loss Ratio',
                            'Combined Ratio']
    
    # Only include the metric columns that exist
    metric_columns_order = [col for col in metric_columns_order if col in df.columns]
    
    # Non-metric columns order display order
    non_metric_columns_order = df.columns.to_series().copy() \
        .drop(metric_columns_order)
    
    # Display non-metric columns first, follwed by metric columns
    display_column_order = non_metric_columns_order.append(pd.Series(metric_columns_order))
    
    df = df[display_column_order]
    
    return df.copy() \
            .style \
            .format({
                'Driver %': '{:.1%}',
                'Total Premium': '${:,.0f}',
                'Average Monthly Premium': '${:,.0f}',
                'Claim %': '{:.1%}',
                'Total Claim Amount': '${:,.0f}',
                'Total Claim Expense': '${:,.0f}',
                'Average Claim Amount': '${:,.0f}',
                'Average Claim Expense': '${:,.0f}',
                'Loss Ratio': '{:.1%}',
                'Combined Ratio': '{:.1%}',
            }) \
            .bar(subset=['Loss Ratio'],
                 color='#FFEBEB',
                 vmin=0,
                 vmax=min(1, df['Loss Ratio'].max() * 1.2)) \
            .bar(subset=['Combined Ratio'],
                 color='#FFDFDF',
                 vmin=0,
                 vmax=min(1, df['Combined Ratio'].max() * 1.2)) \
            .applymap(lambda x: 'color: #BB0000; background-color: #fff; border-bottom: 1px solid white;', 
                      subset=['Loss Ratio']) \
            .applymap(lambda x: 'color: #D42323; background-color: #fff; font-weight: bold; border-bottom: 1px solid #fff;', 
                      subset=['Combined Ratio'])

Loss Ratio by Marital Status

In [55]:
df_loss_ratio_by_marital_status = get_performance_metrics(df_risk_master, 'Marital Status')
display_performance_metrics(df_loss_ratio_by_marital_status)
Out[55]:
Marital Status Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 D 2124 5.9% $4,326,631 $231 80 3.8% $1,196,362 $102,782 $14,955 $1,285 27.7% 30.0%
2 S 24423 68.4% $51,755,556 $240 952 3.9% $9,865,118 $980,630 $10,363 $1,030 19.1% 21.0%
1 M 9165 25.7% $18,401,693 $227 322 3.5% $3,260,719 $326,019 $10,126 $1,012 17.7% 19.5%

Loss Ratio by Gender

In [56]:
df_loss_ratio_by_gender = get_performance_metrics(df_risk_master, 'Gender')
display_performance_metrics(df_loss_ratio_by_gender)
Out[56]:
Gender Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
1 M 22653 63.4% $47,270,615 $237 876 3.9% $9,204,962 $918,137 $10,508 $1,048 19.5% 21.4%
0 F 13059 36.6% $27,213,265 $234 478 3.7% $5,117,237 $491,294 $10,706 $1,028 18.8% 20.6%

Loss Ratio by Primary Vehicle Use

In [57]:
df_loss_ratio_by_vehicle_use = get_performance_metrics(df_risk_master, 'Primary Vehicle Use')
display_performance_metrics(df_loss_ratio_by_vehicle_use)
Out[57]:
Primary Vehicle Use Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 Leisure 8939 25.0% $18,666,746 $236 368 4.1% $4,039,867 $401,860 $10,978 $1,092 21.6% 23.8%
1 Work 26773 75.0% $55,817,135 $236 986 3.7% $10,282,332 $1,007,570 $10,428 $1,022 18.4% 20.2%

Loss Ratio by Marital Status, Gender, Primary Vehicle Use Combined

This is an example to calculate the loss ratios grouped by multiple columns.

In [58]:
df_loss_ratio_by_marital_gender_use = get_performance_metrics(df_risk_master, ['Marital Status', 'Gender', 'Primary Vehicle Use'])
display_performance_metrics(df_loss_ratio_by_marital_gender_use)
Out[58]:
Marital Status Gender Primary Vehicle Use Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
2 D M Leisure 341 1.0% $732,722 $236 12 3.5% $250,920 $22,769 $20,910 $1,897 34.2% 37.4%
1 D F Work 567 1.6% $1,149,590 $231 23 4.1% $368,304 $31,979 $16,013 $1,390 32.0% 34.8%
6 M M Leisure 1414 4.0% $2,834,357 $227 49 3.5% $716,494 $78,763 $14,622 $1,607 25.3% 28.1%
3 D M Work 1001 2.8% $2,026,899 $231 37 3.7% $508,464 $39,944 $13,742 $1,080 25.1% 27.1%
10 S M Leisure 3880 10.9% $8,247,040 $241 166 4.3% $1,713,918 $167,157 $10,325 $1,007 20.8% 22.8%
8 S F Leisure 2242 6.3% $4,712,438 $239 101 4.5% $969,911 $98,659 $9,603 $977 20.6% 22.7%
11 S M Work 11681 32.7% $24,724,068 $240 456 3.9% $4,629,279 $462,817 $10,152 $1,015 18.7% 20.6%
4 M F Leisure 847 2.4% $1,722,769 $227 32 3.8% $319,950 $26,423 $9,998 $826 18.6% 20.1%
9 S F Work 6620 18.5% $14,072,011 $238 229 3.5% $2,552,010 $251,997 $11,144 $1,100 18.1% 19.9%
0 D F Leisure 215 0.6% $417,420 $226 8 3.7% $68,674 $8,090 $8,584 $1,011 16.5% 18.4%
5 M F Work 2568 7.2% $5,139,037 $225 85 3.3% $838,388 $74,146 $9,863 $872 16.3% 17.8%
7 M M Work 4336 12.1% $8,705,530 $227 156 3.6% $1,385,887 $146,688 $8,884 $940 15.9% 17.6%

Loss Ratio by Policy Duration

In [59]:
df_loss_ratio_by_duration = get_performance_metrics(df_risk_master, 'Duration')
display_performance_metrics(df_loss_ratio_by_duration)
Out[59]:
Duration Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
1 Semiannual 17368 48.6% $22,906,968 $232 659 3.8% $6,810,830 $662,908 $10,335 $1,006 29.7% 32.6%
0 Annual 18344 51.4% $51,576,912 $239 695 3.8% $7,511,369 $746,523 $10,808 $1,074 14.6% 16.0%
Analysis There is a huge difference between the loss ratios between annual and biannual policies. This is likely to be an error that was introduced while creating artifical data. Therefore, no further analysis is done for the loss ratios by duration.

Loss Ratio by Underwriting Company

In [60]:
df_loss_ratio_by_underwriter = get_performance_metrics(df_risk_master, 'Underwriting Company ID')
df_loss_ratio_by_underwriter = df_loss_ratio_by_underwriter \
    .merge(right=df_company_dimension[['Underwriting Company ID', 'Underwriting Agency Name', 'Agency City']], on='Underwriting Company ID')
display_performance_metrics(df_loss_ratio_by_underwriter)
Out[60]:
Underwriting Company ID Underwriting Agency Name Agency City Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 5 Mercury CA Insurance Cerritos 5066 14.2% $10,097,886 $227 193 3.8% $2,184,066 $206,501 $11,316 $1,070 21.6% 23.7%
1 7 AIS-Auto Insurance Agency Walnut Creek 5049 14.1% $11,379,012 $256 211 4.2% $2,339,527 $233,366 $11,088 $1,106 20.6% 22.6%
2 4 Califrnial Automobile Insurance Brea 5219 14.6% $10,961,985 $236 198 3.8% $2,162,790 $234,559 $10,923 $1,185 19.7% 21.9%
3 6 Clovis Insurance Agency Clovis 5064 14.2% $10,848,534 $241 205 4.0% $2,045,773 $188,210 $9,979 $918 18.9% 20.6%
4 3 Reliable Write Company Huntington Park 5127 14.4% $10,289,273 $227 192 3.7% $1,881,535 $201,036 $9,800 $1,047 18.3% 20.2%
5 1 Kemper Specility California Los Angeles 4985 14.0% $9,853,289 $223 158 3.2% $1,774,960 $157,042 $11,234 $994 18.0% 19.6%
6 2 Horrace Mann Company Baldwin Park 5202 14.6% $11,053,901 $240 197 3.8% $1,933,548 $188,717 $9,815 $958 17.5% 19.2%

Loss Ratio by Agency

In [61]:
df_loss_ratio_by_agency = get_performance_metrics(df_risk_master, 'Agency ID')
df_loss_ratio_by_agency = df_loss_ratio_by_agency.merge(right=df_agency_dimension[['Agency ID', 'Agency Name']], on='Agency ID')
display_performance_metrics(df_loss_ratio_by_agency)
Out[61]:
Agency ID Agency Name Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 11 Yugine C Sport Insurance 1748 4.9% $3,740,161 $245 92 5.3% $1,349,610 $147,491 $14,670 $1,603 36.1% 40.0%
1 16 Garys Insurance Agency 1783 5.0% $3,431,563 $218 87 4.9% $905,693 $79,744 $10,410 $917 26.4% 28.7%
2 8 Kostyo Insurance Agency 1918 5.4% $3,672,486 $218 75 3.9% $924,146 $102,387 $12,322 $1,365 25.2% 28.0%
3 4 Califrnial Automobile Insurance 1755 4.9% $3,909,470 $250 66 3.8% $925,574 $92,758 $14,024 $1,405 23.7% 26.0%
4 19 Clean Energy Car Insurance 1802 5.0% $3,678,116 $228 58 3.2% $848,119 $80,761 $14,623 $1,392 23.1% 25.3%
5 5 Mercury CA Insurance 1790 5.0% $3,857,763 $245 66 3.7% $832,650 $76,063 $12,616 $1,152 21.6% 23.6%
6 15 OneStop Car Insurance 1779 5.0% $3,982,061 $249 66 3.7% $851,280 $72,114 $12,898 $1,093 21.4% 23.2%
7 3 Reliable Insurance 1888 5.3% $4,186,657 $254 86 4.6% $805,767 $90,118 $9,369 $1,048 19.2% 21.4%
8 10 Suncan Insurance Agency 1881 5.3% $3,797,221 $232 77 4.1% $718,254 $72,893 $9,328 $947 18.9% 20.8%
9 20 USAVE Car Insurance Agency 1848 5.2% $3,698,694 $227 71 3.8% $686,522 $64,882 $9,669 $914 18.6% 20.3%
10 1 National Insurance 1810 5.1% $3,946,293 $243 67 3.7% $727,365 $68,530 $10,856 $1,023 18.4% 20.2%
11 12 Alandale Car Insurance 1725 4.8% $3,247,470 $211 52 3.0% $562,090 $47,177 $10,809 $907 17.3% 18.8%
12 17 Mr. Right Car Insurance 1742 4.9% $3,619,403 $231 65 3.7% $603,069 $59,885 $9,278 $921 16.7% 18.3%
13 13 Rose Insurance 1784 5.0% $3,735,112 $242 60 3.4% $593,791 $63,040 $9,897 $1,051 15.9% 17.6%
14 18 Bay Area Car Insuarnce 1724 4.8% $3,666,456 $241 69 4.0% $585,720 $58,466 $8,489 $847 16.0% 17.6%
15 9 Pregil Insurance Agency 1779 5.0% $3,824,107 $240 62 3.5% $596,082 $64,901 $9,614 $1,047 15.6% 17.3%
16 7 AIS-Auto Insurance Agency 1768 5.0% $3,962,506 $251 70 4.0% $556,884 $48,274 $7,955 $690 14.1% 15.3%
17 14 PickWise Car Insuarance 1722 4.8% $3,331,103 $219 65 3.8% $453,147 $40,090 $6,971 $617 13.6% 14.8%
18 2 Auto AA Insurance 1764 4.9% $3,757,411 $243 57 3.2% $461,041 $53,716 $8,088 $942 12.3% 13.7%
19 6 Clovis Insurance Agency 1702 4.8% $3,439,828 $229 43 2.5% $335,395 $26,139 $7,800 $608 9.8% 10.5%

Loss Ratio by Age Bin

In [62]:
df_loss_ratio_by_age_bin = get_performance_metrics(df_risk_master, 'Age Bin')
display_performance_metrics(df_loss_ratio_by_age_bin)
Out[62]:
Age Bin Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
5 60-69 579 1.6% $1,088,386 $211 29 5.0% $265,370 $23,560 $9,151 $812 24.4% 26.5%
4 50-59 2853 8.0% $5,445,670 $217 106 3.7% $1,173,675 $119,093 $11,072 $1,124 21.6% 23.7%
3 40-49 5989 16.8% $12,129,623 $230 240 4.0% $2,343,507 $226,273 $9,765 $943 19.3% 21.2%
2 30-39 6240 17.5% $13,284,798 $239 233 3.7% $2,539,279 $248,060 $10,898 $1,065 19.1% 21.0%
0 10-19 5893 16.5% $12,392,508 $238 231 3.9% $2,298,854 $246,939 $9,952 $1,069 18.6% 20.5%
1 20-29 9509 26.6% $20,454,376 $244 346 3.6% $3,749,346 $345,090 $10,836 $997 18.3% 20.0%
6 70+ 52 0.1% $103,291 $220 0 0.0% $0 $0 $nan $nan 0.0% 0.0%

Loss Ratio by Vehicle Make

In [63]:
df_loss_ratio_by_vehicle_make = get_performance_metrics(df_risk_master, 'Make')
display_performance_metrics(df_loss_ratio_by_vehicle_make)
Out[63]:
Make Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
14 Lexus 1580 4.4% $3,529,279 $254 63 4.0% $1,283,240 $124,782 $20,369 $1,981 36.4% 39.9%
16 Mercedes-Benz 2109 5.9% $4,616,940 $250 74 3.5% $1,566,740 $151,341 $21,172 $2,045 33.9% 37.2%
2 BMW 3271 9.2% $7,035,813 $243 133 4.1% $2,167,830 $195,225 $16,299 $1,468 30.8% 33.6%
10 Jaguar 2468 6.9% $5,278,412 $241 98 4.0% $1,440,020 $136,806 $14,694 $1,396 27.3% 29.9%
13 Land Rover 2186 6.1% $4,683,183 $244 94 4.3% $1,225,490 $116,781 $13,037 $1,242 26.2% 28.7%
15 Mazda 1410 3.9% $2,747,581 $224 72 5.1% $536,460 $53,778 $7,451 $747 19.5% 21.5%
7 Honda 1408 3.9% $2,847,332 $228 65 4.6% $540,930 $57,998 $8,322 $892 19.0% 21.0%
3 Chevrolet 1519 4.3% $3,028,629 $225 69 4.5% $539,320 $50,695 $7,816 $735 17.8% 19.5%
8 Hyundai 728 2.0% $1,496,479 $234 35 4.8% $250,920 $29,272 $7,169 $836 16.8% 18.7%
5 Dodge 882 2.5% $1,758,542 $226 51 5.8% $288,956 $31,357 $5,666 $615 16.4% 18.2%
0 Acura 2715 7.6% $5,710,596 $239 94 3.5% $870,730 $91,688 $9,263 $975 15.2% 16.9%
19 TOYOTA 2467 6.9% $5,008,673 $230 95 3.9% $756,460 $75,746 $7,963 $797 15.1% 16.6%
4 Chrysler 1386 3.9% $2,807,101 $227 52 3.8% $380,240 $42,098 $7,312 $810 13.5% 15.0%
11 Jeep 1747 4.9% $3,503,352 $226 80 4.6% $460,000 $43,998 $5,750 $550 13.1% 14.4%
9 Infiniti 2627 7.4% $5,571,364 $238 67 2.6% $681,903 $64,688 $10,178 $965 12.2% 13.4%
17 Mitsubishi 940 2.6% $1,892,308 $226 46 4.9% $225,440 $20,947 $4,901 $455 11.9% 13.0%
18 Nissan 1095 3.1% $2,197,748 $225 54 4.9% $224,200 $23,523 $4,152 $436 10.2% 11.3%
6 Ford 1919 5.4% $3,919,551 $229 56 2.9% $376,000 $44,124 $6,714 $788 9.6% 10.7%
1 Audi 2147 6.0% $4,567,834 $239 35 1.6% $342,370 $40,292 $9,782 $1,151 7.5% 8.4%
12 Kia 1108 3.1% $2,283,163 $231 21 1.9% $164,950 $14,292 $7,855 $681 7.2% 7.9%

Loss Ratio by Vehicle ID (Model)

In [64]:
df_loss_ratio_by_vehicle_id = get_performance_metrics(df_risk_master, 'Vehicle ID')

# Add "Make Model" column
df_loss_ratio_by_vehicle_id = df_loss_ratio_by_vehicle_id.merge(df_vehicles_dimension[['Vehicle ID', 'Car Make', 'Model']], on='Vehicle ID', how='left')
df_loss_ratio_by_vehicle_id['Make Model'] = df_loss_ratio_by_vehicle_id['Car Make'] + ' ' + df_loss_ratio_by_vehicle_id['Model']

# Drop redundant columns
df_loss_ratio_by_vehicle_id.drop(columns=['Car Make', 'Model'], inplace=True)

display_performance_metrics(df_loss_ratio_by_vehicle_id)
Out[64]:
Vehicle ID Make Model Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 85 BMW 650 Grand Coupe 490 1.4% $1,111,664 $262 28 5.7% $769,120 $53,259 $27,469 $1,902 69.2% 74.0%
1 87 Mercedes-Benz Maybatch 461 1.3% $1,088,527 $268 20 4.3% $620,840 $48,223 $31,042 $2,411 57.0% 61.5%
2 82 Jaguar XJ 626 1.8% $1,407,289 $256 30 4.8% $781,860 $66,118 $26,062 $2,204 55.6% 60.3%
3 86 Mercedes-Benz S Class Sedan 525 1.5% $1,223,834 $256 24 4.6% $603,480 $75,773 $25,145 $3,157 49.3% 55.5%
4 84 Lexus LX 472 1.3% $1,062,072 $254 23 4.9% $502,850 $53,818 $21,863 $2,340 47.3% 52.4%
5 83 Lexus GS 467 1.3% $1,052,998 $256 24 5.1% $463,140 $44,803 $19,298 $1,867 44.0% 48.2%
6 79 BMW 535i 625 1.8% $1,377,464 $251 30 4.8% $590,850 $59,274 $19,695 $1,976 42.9% 47.2%
7 75 Land Rover LR4 828 2.3% $1,790,627 $243 42 5.1% $661,500 $64,860 $15,750 $1,544 36.9% 40.6%
8 26 Dodge Grand Caravan 194 0.5% $362,287 $215 17 8.8% $125,810 $15,668 $7,401 $922 34.7% 39.1%
9 8 Hyundai Elantra 187 0.5% $369,209 $226 15 8.0% $118,320 $15,431 $7,888 $1,029 32.0% 36.2%
10 44 Mazda CX-9 446 1.2% $846,493 $228 13 2.9% $253,170 $26,190 $19,475 $2,015 29.9% 33.0%
11 23 Honda Accord 171 0.5% $351,903 $227 17 9.9% $91,520 $11,224 $5,384 $660 26.0% 29.2%
12 33 Honda Civic 359 1.0% $745,874 $230 19 5.3% $195,000 $18,178 $10,263 $957 26.1% 28.6%
13 73 TOYOTA Sienna 521 1.5% $1,068,074 $237 20 3.8% $268,840 $31,626 $13,442 $1,581 25.2% 28.1%
14 17 Land Rover EVOQUE 211 0.6% $403,523 $223 15 7.1% $101,850 $10,754 $6,790 $717 25.2% 27.9%
15 77 BMW X5 632 1.8% $1,378,841 $249 25 4.0% $344,850 $36,658 $13,794 $1,466 25.0% 27.7%
16 55 Jaguar XE 751 2.1% $1,523,954 $232 38 5.1% $367,540 $41,075 $9,672 $1,081 24.1% 26.8%
17 59 BMW X3 566 1.6% $1,191,949 $233 24 4.2% $284,620 $27,664 $11,859 $1,153 23.9% 26.2%
18 4 Mazda Mazda3 192 0.5% $373,370 $217 17 8.9% $87,360 $9,461 $5,139 $557 23.4% 25.9%
19 46 Chrysler Town and County 456 1.3% $918,173 $221 25 5.5% $210,000 $24,309 $8,400 $972 22.9% 25.5%
20 76 Mercedes-Benz E Class Sedan 598 1.7% $1,240,194 $243 22 3.7% $294,840 $20,379 $13,402 $926 23.8% 25.4%
21 80 Lexus LS 641 1.8% $1,414,209 $253 16 2.5% $317,250 $26,160 $19,828 $1,635 22.4% 24.3%
22 10 Jeep Patriot 192 0.5% $406,832 $232 15 7.8% $88,230 $8,670 $5,882 $578 21.7% 23.8%
23 25 Nissan Altima 187 0.5% $378,833 $221 13 7.0% $81,620 $7,238 $6,278 $557 21.5% 23.5%
24 15 Dodge Journey 159 0.4% $321,341 $228 16 10.1% $70,200 $4,892 $4,388 $306 21.8% 23.4%
25 81 Land Rover Range Rover 647 1.8% $1,436,975 $258 23 3.6% $301,600 $23,664 $13,113 $1,029 21.0% 22.6%
26 51 Chevrolet Volt 558 1.6% $1,140,595 $229 24 4.3% $235,290 $22,080 $9,804 $920 20.6% 22.6%
27 3 Chevrolet Sonic 192 0.5% $387,531 $226 13 6.8% $80,550 $6,177 $6,196 $475 20.8% 22.4%
28 78 Acura RLX 667 1.9% $1,425,483 $243 25 3.7% $278,480 $36,297 $11,139 $1,452 19.5% 22.1%
29 37 Honda Odyssey 335 0.9% $668,482 $226 14 4.2% $132,720 $11,124 $9,480 $795 19.9% 21.5%
30 61 Audi A3 SportBack 555 1.6% $1,142,499 $238 16 2.9% $221,730 $24,064 $13,858 $1,504 19.4% 21.5%
31 43 Jeep Grand Cherokee 437 1.2% $871,721 $232 24 5.5% $169,650 $13,209 $7,069 $550 19.5% 21.0%
32 1 Nissan Versa 182 0.5% $334,693 $208 15 8.2% $62,260 $7,536 $4,151 $502 18.6% 20.9%
33 65 Acura MDX 551 1.5% $1,174,114 $239 23 4.2% $213,710 $17,312 $9,292 $753 18.2% 19.7%
34 14 TOYOTA Corolla iM 200 0.6% $413,887 $227 15 7.5% $69,730 $6,823 $4,649 $455 16.8% 18.5%
35 45 TOYOTA Camry 435 1.2% $878,385 $226 20 4.6% $145,870 $14,213 $7,294 $711 16.6% 18.2%
36 9 Jeep Renegare 198 0.6% $386,620 $223 15 7.6% $61,030 $8,439 $4,069 $563 15.8% 18.0%
37 2 Chevrolet Spark 199 0.6% $369,539 $218 13 6.5% $58,630 $6,104 $4,510 $470 15.9% 17.5%
38 13 Jeep Compass 175 0.5% $351,779 $224 8 4.6% $54,150 $7,083 $6,769 $885 15.4% 17.4%
39 21 Chevrolet Malibu 189 0.5% $357,872 $220 7 3.7% $58,740 $2,343 $8,391 $335 16.4% 17.1%
40 71 Land Rover New Discovery 500 1.4% $1,052,058 $237 14 2.8% $160,540 $17,503 $11,467 $1,250 15.3% 16.9%
41 30 Chrysler 200 Series 369 1.0% $723,966 $224 18 4.9% $110,400 $11,194 $6,133 $622 15.2% 16.8%
42 70 Infiniti QX70 521 1.5% $1,093,942 $236 19 3.6% $162,000 $21,141 $8,526 $1,113 14.8% 16.7%
43 48 Acura TLX 551 1.5% $1,175,869 $242 17 3.1% $173,440 $17,101 $10,202 $1,006 14.7% 16.2%
44 39 Mitsubishi OutLander 383 1.1% $776,289 $226 17 4.4% $111,720 $11,659 $6,572 $686 14.4% 15.9%
45 24 Mitsubishi MiEV 183 0.5% $369,696 $229 13 7.1% $55,660 $2,675 $4,282 $206 15.1% 15.8%
46 74 Infiniti Q70 509 1.4% $1,063,322 $236 12 2.4% $154,840 $10,800 $12,903 $900 14.6% 15.6%
47 35 Chevrolet Impala 381 1.1% $773,093 $226 12 3.1% $106,110 $13,991 $8,842 $1,166 13.7% 15.5%
48 38 Honda Pilot 355 1.0% $727,208 $232 10 2.8% $98,000 $14,778 $9,800 $1,478 13.5% 15.5%
49 52 Infiniti Q50 542 1.5% $1,160,993 $241 13 2.4% $163,059 $16,540 $12,543 $1,272 14.0% 15.5%
50 36 Acura ILX 395 1.1% $789,289 $226 12 3.0% $107,800 $10,822 $8,983 $902 13.7% 15.0%
51 20 Mazda Mazda5 370 1.0% $711,986 $221 18 4.9% $97,230 $9,185 $5,402 $510 13.7% 14.9%
52 72 Jaguar XF 557 1.6% $1,204,701 $240 16 2.9% $162,620 $15,016 $10,164 $939 13.5% 14.7%
53 40 TOYOTA Prius 600 1.7% $1,200,889 $225 22 3.7% $161,280 $13,446 $7,331 $611 13.4% 14.5%
54 32 Ford C-Max Hybrid 395 1.1% $790,342 $226 18 4.6% $98,640 $14,921 $5,480 $829 12.5% 14.4%
55 19 Mazda Mazda6 202 0.6% $404,561 $219 12 5.9% $50,610 $4,780 $4,218 $398 12.5% 13.7%
56 60 Infiniti Q60 503 1.4% $1,068,568 $240 13 2.6% $136,644 $9,426 $10,511 $725 12.8% 13.7%
57 54 Hyundai Azera 541 1.5% $1,127,270 $237 20 3.7% $132,600 $13,841 $6,630 $692 11.8% 13.0%
58 18 Mazda CX-5 200 0.6% $411,170 $232 12 6.0% $48,090 $4,162 $4,008 $347 11.7% 12.7%
59 63 Jaguar F-PACE 534 1.5% $1,142,468 $236 14 2.6% $128,000 $14,596 $9,143 $1,043 11.2% 12.5%
60 57 Kia Kia Sorento 512 1.4% $1,062,525 $234 11 2.1% $116,550 $11,476 $10,595 $1,043 11.0% 12.0%
61 42 Ford Flex 458 1.3% $932,448 $225 11 2.4% $91,060 $9,294 $8,278 $845 9.8% 10.8%
62 34 Dodge Challenger 338 0.9% $696,587 $230 10 3.0% $66,766 $8,015 $6,677 $801 9.6% 10.7%
63 53 BMW 328i 535 1.5% $1,085,450 $232 12 2.2% $100,300 $10,149 $8,358 $846 9.2% 10.2%
64 58 TOYOTA RAV4 521 1.5% $1,080,284 $238 12 2.3% $97,650 $8,302 $8,138 $692 9.0% 9.8%
65 41 BMW 228i 423 1.2% $890,445 $229 14 3.3% $78,090 $8,222 $5,578 $587 8.8% 9.7%
66 68 Ford Expedition 523 1.5% $1,101,031 $236 11 2.1% $95,040 $10,776 $8,640 $980 8.6% 9.6%
67 12 Mitsubishi Lancer 185 0.5% $374,597 $227 8 4.3% $31,500 $3,825 $3,938 $478 8.4% 9.4%
68 56 Acura RDX 551 1.5% $1,145,841 $241 17 3.1% $97,300 $10,157 $5,724 $597 8.5% 9.4%
69 31 Ford Fusion 371 1.0% $723,581 $221 11 3.0% $60,240 $6,079 $5,476 $553 8.3% 9.2%
70 22 Ford Focus SE 172 0.5% $372,149 $239 5 2.9% $31,020 $3,054 $6,204 $611 8.3% 9.2%
71 16 Kia Kia Forte 184 0.5% $380,045 $229 5 2.7% $32,200 $2,090 $6,440 $418 8.5% 9.0%
72 5 Mitsubishi Mirage 189 0.5% $371,725 $225 8 4.2% $26,560 $2,787 $3,320 $348 7.1% 7.9%
73 7 Dodge Dart 191 0.5% $378,327 $226 8 4.2% $26,180 $2,783 $3,272 $348 6.9% 7.7%
74 6 Nissan Sentra 188 0.5% $358,783 $209 11 5.9% $24,960 $2,054 $2,269 $187 7.0% 7.5%
75 27 Honda CRV 188 0.5% $353,865 $222 5 2.7% $23,690 $2,693 $4,738 $539 6.7% 7.5%
76 28 Jeep Wrangler 400 1.1% $803,461 $222 8 2.0% $55,200 $4,064 $6,900 $508 6.9% 7.4%
77 67 Infiniti QX60 552 1.5% $1,184,539 $239 10 1.8% $65,360 $6,781 $6,536 $678 5.5% 6.1%
78 64 Audi A5 Coupe 514 1.4% $1,090,685 $240 9 1.8% $56,700 $7,400 $6,300 $822 5.2% 5.9%
79 49 Chrysler 300 Series 561 1.6% $1,164,961 $233 9 1.6% $59,840 $6,595 $6,649 $733 5.1% 5.7%
80 50 Nissan Maxima 538 1.5% $1,125,439 $238 15 2.8% $55,360 $6,694 $3,691 $446 4.9% 5.5%
81 62 Mercedes-Benz C Class Sedan 525 1.5% $1,064,385 $234 8 1.5% $47,580 $6,965 $5,948 $871 4.5% 5.1%
82 29 Jeep Cherokee 345 1.0% $682,940 $223 10 2.9% $31,740 $2,532 $3,174 $253 4.6% 5.0%
83 11 TOYOTA Yaris 190 0.5% $367,154 $215 6 3.2% $13,090 $1,336 $2,182 $223 3.6% 3.9%
84 66 Audi S3 503 1.4% $1,135,602 $246 5 1.0% $37,840 $4,949 $7,568 $990 3.3% 3.8%
85 69 Audi Q5 575 1.6% $1,199,048 $234 5 0.9% $26,100 $3,879 $5,220 $776 2.2% 2.5%
86 47 Kia Kia Sportage 412 1.2% $840,594 $228 5 1.2% $16,200 $726 $3,240 $145 1.9% 2.0%

Loss Ratio by County

In [65]:
df_loss_ratio_by_county = get_performance_metrics(df_risk_master, 'County')
display_performance_metrics(df_loss_ratio_by_county)
Out[65]:
County Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
19 San Benito County 73 0.2% $162,984 $236 2 2.7% $89,070 $15,644 $44,535 $7,822 54.6% 64.2%
29 Shasta County 191 0.5% $380,337 $227 11 5.8% $144,320 $18,808 $13,120 $1,710 37.9% 42.9%
13 Monterey County 237 0.7% $498,278 $237 7 3.0% $178,660 $16,544 $25,523 $2,363 35.9% 39.2%
20 San Bernardino County 1869 5.2% $3,882,667 $234 76 4.1% $1,010,513 $93,999 $13,296 $1,237 26.0% 28.4%
28 Santa Cruz County 159 0.4% $358,818 $246 4 2.5% $89,450 $10,239 $22,362 $2,560 24.9% 27.8%
33 Sutter County 86 0.2% $179,527 $239 6 7.0% $45,440 $3,202 $7,573 $534 25.3% 27.1%
18 Sacramento County 1491 4.2% $3,103,274 $236 54 3.6% $683,922 $90,534 $12,665 $1,677 22.0% 25.0%
6 Kern County 678 1.9% $1,431,021 $238 23 3.4% $300,675 $34,676 $13,073 $1,508 21.0% 23.4%
26 Santa Barbara County 285 0.8% $593,467 $237 6 2.1% $120,100 $18,664 $20,017 $3,111 20.2% 23.4%
27 Santa Clara County 2098 5.9% $4,380,825 $236 92 4.4% $910,389 $90,112 $9,896 $979 20.8% 22.8%
32 Stanislaus County 556 1.6% $1,160,603 $233 26 4.7% $241,240 $22,994 $9,278 $884 20.8% 22.8%
17 Riverside County 2097 5.9% $4,428,822 $237 86 4.1% $925,450 $79,270 $10,761 $922 20.9% 22.7%
1 Butte County 156 0.4% $327,625 $231 7 4.5% $65,020 $8,182 $9,289 $1,169 19.8% 22.3%
22 San Francisco County 691 1.9% $1,432,335 $237 24 3.5% $287,340 $25,198 $11,972 $1,050 20.1% 21.8%
25 San Mateo County 892 2.5% $1,863,738 $236 41 4.6% $360,185 $35,294 $8,785 $861 19.3% 21.2%
4 Fresno County 803 2.2% $1,729,524 $236 29 3.6% $334,480 $30,370 $11,534 $1,047 19.3% 21.1%
8 Los Angeles County 9842 27.6% $20,449,043 $235 375 3.8% $3,906,400 $382,580 $10,417 $1,020 19.1% 21.0%
21 San Diego County 3216 9.0% $6,725,499 $236 120 3.7% $1,262,130 $126,707 $10,518 $1,056 18.8% 20.7%
23 San Joaquin County 703 2.0% $1,445,655 $237 30 4.3% $277,680 $20,592 $9,256 $686 19.2% 20.6%
24 San Luis Obispo County 162 0.5% $351,015 $247 8 4.9% $63,790 $4,892 $7,974 $611 18.2% 19.6%
30 Solano County 504 1.4% $1,056,641 $235 15 3.0% $188,370 $18,045 $12,558 $1,203 17.8% 19.5%
15 Orange County 3106 8.7% $6,490,124 $236 123 4.0% $1,158,611 $97,789 $9,420 $795 17.9% 19.4%
0 Alameda County 1517 4.2% $3,158,371 $237 54 3.6% $550,310 $47,770 $10,191 $885 17.4% 18.9%
31 Sonoma County 530 1.5% $1,079,165 $235 17 3.2% $174,850 $21,845 $10,285 $1,285 16.2% 18.2%
3 El Dorado County 167 0.5% $350,657 $234 9 5.4% $53,930 $5,819 $5,992 $647 15.4% 17.0%
36 Ventura County 918 2.6% $1,928,197 $234 32 3.5% $289,790 $30,748 $9,056 $961 15.0% 16.6%
16 Placer County 277 0.8% $563,923 $236 7 2.5% $85,191 $8,183 $12,170 $1,169 15.1% 16.6%
2 Contra Costa County 1022 2.9% $2,084,747 $234 31 3.0% $289,405 $22,491 $9,336 $726 13.9% 15.0%
5 Imperial County 134 0.4% $264,064 $232 4 3.0% $34,350 $3,931 $8,588 $983 13.0% 14.5%
34 Tehama County 50 0.1% $107,160 $231 2 4.0% $14,020 $989 $7,010 $494 13.1% 14.0%
10 Marin County 160 0.4% $339,589 $233 7 4.4% $37,680 $4,769 $5,383 $681 11.1% 12.5%
7 Kings County 96 0.3% $199,403 $234 2 2.1% $21,820 $2,826 $10,910 $1,413 10.9% 12.4%
9 Madera County 87 0.2% $187,021 $227 2 2.3% $18,880 $2,899 $9,440 $1,450 10.1% 11.6%
38 Yuba County 67 0.2% $159,597 $247 2 3.0% $11,570 $1,498 $5,785 $749 7.2% 8.2%
11 Mendocino County 45 0.1% $102,117 $232 2 4.4% $7,260 $529 $3,630 $264 7.1% 7.6%
35 Tulare County 338 0.9% $723,427 $241 9 2.7% $44,488 $6,059 $4,943 $673 6.1% 7.0%
12 Merced County 191 0.5% $378,405 $229 3 1.6% $22,240 $2,841 $7,413 $947 5.9% 6.6%
14 Napa County 71 0.2% $146,639 $238 2 2.8% $7,800 $1,010 $3,900 $505 5.3% 6.0%
37 Yolo County 146 0.4% $278,251 $227 4 2.7% $15,380 $889 $3,845 $222 5.5% 5.8%

Loss Ratio by City

Only display top 20 cities with the largest number of policyholders

In [66]:
df_loss_ratio_by_city = get_performance_metrics(df_risk_master, 'City')
display_performance_metrics(df_loss_ratio_by_city.nlargest(20, 'Number of Drivers').sort_values('Combined Ratio', ascending=False))
# display_performance_metrics(df_loss_ratio_by_city)
Out[66]:
City Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
129 Long Beach 591 1.7% $1,238,293 $240 30 5.1% $494,906 $43,386 $16,497 $1,446 40.0% 43.5%
102 Huntington Beach 231 0.6% $497,251 $236 9 3.9% $162,420 $10,377 $18,047 $1,153 32.7% 34.8%
271 Torrance 253 0.7% $494,413 $228 12 4.7% $147,971 $14,818 $12,331 $1,235 29.9% 32.9%
153 Moreno Valley 249 0.7% $513,278 $237 8 3.2% $162,570 $6,039 $20,321 $755 31.7% 32.8%
5 Anaheim 372 1.0% $755,987 $234 25 6.7% $200,690 $17,415 $8,028 $697 26.5% 28.9%
14 Bakersfield 559 1.6% $1,175,551 $237 19 3.4% $269,905 $31,601 $14,206 $1,663 23.0% 25.6%
218 Sacramento 651 1.8% $1,350,018 $236 21 3.2% $280,525 $37,600 $13,358 $1,790 20.8% 23.6%
220 San Bernardino 239 0.7% $483,598 $235 14 5.9% $104,640 $6,800 $7,474 $486 21.6% 23.0%
241 Santa Ana 390 1.1% $796,981 $232 15 3.8% $168,400 $14,844 $11,227 $990 21.1% 23.0%
227 San Francisco 691 1.9% $1,432,335 $237 24 3.5% $287,340 $25,198 $11,972 $1,050 20.1% 21.8%
54 Corona 290 0.8% $637,910 $246 16 5.5% $128,350 $10,492 $8,022 $656 20.1% 21.8%
169 Oakland 388 1.1% $809,819 $238 22 5.7% $154,330 $15,198 $7,015 $691 19.1% 20.9%
229 San Jose 1156 3.2% $2,407,115 $234 52 4.5% $466,352 $33,713 $8,968 $648 19.4% 20.8%
212 Riverside 432 1.2% $893,701 $235 20 4.6% $166,790 $16,596 $8,340 $830 18.7% 20.5%
146 Modesto 310 0.9% $636,783 $236 13 4.2% $106,870 $9,524 $8,221 $733 16.8% 18.3%
224 San Diego 1376 3.9% $2,922,260 $239 54 3.9% $442,956 $50,063 $8,203 $927 15.2% 16.9%
130 Los Angeles 2269 6.4% $4,712,905 $234 79 3.5% $702,277 $72,819 $8,890 $922 14.9% 16.4%
262 Stockton 364 1.0% $768,052 $236 11 3.0% $102,220 $8,741 $9,293 $795 13.3% 14.4%
84 Fresno 550 1.5% $1,175,991 $235 14 2.5% $118,180 $8,348 $8,441 $596 10.0% 10.8%
83 Fremont 226 0.6% $444,704 $229 6 2.7% $36,460 $3,511 $6,077 $585 8.2% 9.0%

Loss Ratio by ZIP

Only display top 20 ZIP codes with the largest number of policyholders

In [67]:
df_loss_ratio_by_ZIP = get_performance_metrics(df_risk_master, 'ZIP')
display_performance_metrics(df_loss_ratio_by_ZIP.nlargest(20, 'Number of Drivers').sort_values('Combined Ratio', ascending=False))
# display_performance_metrics(df_loss_ratio_by_ZIP)
Out[67]:
ZIP Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
174 91786 80 0.2% $166,023 $236 4 5.0% $136,140 $21,958 $34,035 $5,490 82.0% 95.2%
306 92701 81 0.2% $172,778 $243 7 8.6% $94,920 $8,503 $13,560 $1,215 54.9% 59.9%
185 91950 83 0.2% $172,148 $233 4 4.8% $79,110 $9,163 $19,778 $2,291 46.0% 51.3%
278 92570 90 0.3% $198,724 $236 6 6.7% $68,560 $8,456 $11,427 $1,409 34.5% 38.8%
93 90810 84 0.2% $175,348 $246 4 4.8% $51,770 $5,811 $12,942 $1,453 29.5% 32.8%
354 93291 85 0.2% $207,802 $255 4 4.7% $31,343 $4,816 $7,836 $1,204 15.1% 17.4%
404 94025 89 0.2% $163,814 $232 1 1.1% $24,440 $978 $24,440 $978 14.9% 15.5%
320 92821 81 0.2% $199,283 $251 2 2.5% $21,570 $1,664 $10,785 $832 10.8% 11.7%
598 96003 86 0.2% $163,460 $221 3 3.5% $14,000 $2,076 $4,667 $692 8.6% 9.8%
100 91016 82 0.2% $166,075 $226 3 3.7% $14,760 $1,314 $4,920 $438 8.9% 9.7%
273 92553 83 0.2% $162,380 $229 1 1.2% $12,880 $129 $12,880 $129 7.9% 8.0%
428 94303 82 0.2% $178,122 $246 2 2.4% $6,300 $521 $3,150 $260 3.5% 3.8%
439 94521 81 0.2% $152,939 $234 1 1.2% $5,057 $759 $5,057 $759 3.3% 3.8%
325 92843 87 0.2% $186,561 $241 1 1.1% $2,850 $285 $2,850 $285 1.5% 1.7%
70 90630 84 0.2% $189,225 $239 1 1.2% $2,660 $213 $2,660 $213 1.4% 1.5%
451 94546 83 0.2% $171,783 $230 0 0.0% $0 $0 $nan $nan 0.0% 0.0%
373 93535 82 0.2% $167,370 $224 0 0.0% $0 $0 $nan $nan 0.0% 0.0%
351 93257 86 0.2% $169,393 $239 0 0.0% $0 $0 $nan $nan 0.0% 0.0%
393 93722 81 0.2% $181,117 $235 0 0.0% $0 $0 $nan $nan 0.0% 0.0%
128 91362 91 0.3% $204,116 $243 0 0.0% $0 $0 $nan $nan 0.0% 0.0%

Loss Ratio by Vehicle Model Year

Looking at the loss ratios by vehicle age may be a better analysis. However, I've included this since a few groups are doing analysis on the loss ratios by vehicle model year.

In [68]:
df_loss_ratio_by_model_year = get_performance_metrics(df_risk_master, 'Vehicle Model Year')
display_performance_metrics(df_loss_ratio_by_model_year.sort_values('Vehicle Model Year'))
Out[68]:
Vehicle Model Year Number of Drivers Driver % Total Premium Average Monthly Premium Claim Count Claim % Total Claim Amount Total Claim Expense Average Claim Amount Average Claim Expense Loss Ratio Combined Ratio
0 2005 4052 11.3% $10,493,967 $295 184 4.5% $1,559,673 $150,359 $8,476 $817 14.9% 16.3%
1 2006 3888 10.9% $9,595,777 $281 153 3.9% $1,757,850 $176,493 $11,489 $1,154 18.3% 20.2%
2 2007 3958 11.1% $9,430,771 $267 175 4.4% $1,990,719 $187,234 $11,376 $1,070 21.1% 23.1%
3 2008 3894 10.9% $8,601,692 $249 157 4.0% $1,807,154 $153,667 $11,511 $979 21.0% 22.8%
4 2009 4044 11.3% $8,337,996 $235 156 3.9% $1,619,952 $173,474 $10,384 $1,112 19.4% 21.5%
5 2010 3982 11.2% $7,809,302 $221 140 3.5% $1,428,054 $160,594 $10,200 $1,147 18.3% 20.3%
6 2011 4019 11.3% $7,431,155 $207 125 3.1% $1,421,926 $138,411 $11,375 $1,107 19.1% 21.0%
7 2012 3984 11.2% $6,715,199 $190 119 3.0% $1,357,977 $129,874 $11,412 $1,091 20.2% 22.2%
8 2013 3891 10.9% $6,068,021 $175 145 3.7% $1,378,894 $139,326 $9,510 $961 22.7% 25.0%

Claim Analysis

Which claim branch is processing the largest number of claims?

In [69]:
df_claims_by_branch = df_claims_master.groupby(['Branch ID', 'Branch Name']).agg({
    'Claim Amount': 'sum', 
    'Claim ID': 'count', 
    'Claim Expense': 'sum'
}).reset_index().rename(columns={ 'Claim ID': 'Claim Count' })

df_claims_by_branch['Branch'] = df_claims_by_branch['Branch Name'].str.replace(r' CA Branch', '')
df_claims_by_branch.drop(columns=['Branch Name'], inplace=True)
df_claims_by_branch['Average Claim Amount'] = df_claims_by_branch['Claim Amount']  / df_claims_by_branch['Claim Count']
df_claims_by_branch['Average Claim Expense'] = df_claims_by_branch['Claim Expense']  / df_claims_by_branch['Claim Count']
df_claims_by_branch['Claim Expense Ratio'] = df_claims_by_branch['Claim Expense']  / df_claims_by_branch['Claim Amount']

df_claims_by_branch
Out[69]:
Branch ID Claim Amount Claim Count Claim Expense Branch Average Claim Amount Average Claim Expense Claim Expense Ratio
0 1 2618884 232 197667.62 San Deigo 11288.293103 852.015603 0.075478
1 2 2458995 252 179419.03 Watsonville 9757.916667 711.980278 0.072964
2 3 6372146 613 817075.68 Norwak 10395.017945 1332.913018 0.128226
3 4 2872174 257 215267.87 Sylmar 11175.774319 837.618171 0.074949
In [70]:
fig = make_subplots(1, 2, specs=[[{'type': 'domain'}, {'type': 'domain'}]], subplot_titles=['By Number of Claims', 'By Total Claim Amount'])

fig.add_trace(
    go.Pie(
        labels=df_claims_by_branch['Branch'],
        values=df_claims_by_branch['Claim Count'],
        text=df_claims_by_branch['Branch'],
        textposition='inside',
        insidetextorientation='horizontal',
        pull=[0.0, 0, 0.05, 0],
        sort=False,
        hole=0.3
    ), 1, 1
)

fig.add_trace(
    go.Pie(
        labels=df_claims_by_branch['Branch'],
        values=df_claims_by_branch['Claim Amount'],
        text=df_claims_by_branch['Branch'],
        textposition='inside',
        insidetextorientation='horizontal',
        pull=[0.0, 0, 0.05, 0],
        sort=False,
        hole=0.3,
    ), 1, 2
)

fig.update_layout(title_text='Claims Breakdown by Count & Amount')
fig.show()
Conclusion From looking at the breakdowns above, Norwak branch processes the largest number of claims both by frequency and amount.

Which claim branch has the highest claim expense ratio?

Claim expense ratio is defined as Claim Expense / Claim Amount.

In [71]:
fig = px.bar(
    df_claims_by_branch,
    x='Branch',
    y='Claim Expense Ratio',
    title='Claim Expense Ratio by Claim Branch',
    text=(df_claims_by_branch['Claim Expense Ratio'] * 100).round(2).astype(str) + '%',
    color='Branch',
    labels={
        'Branch': 'Claim Branch'
    },
    range_y=[0, 0.2]
)

fig.update_layout(
    yaxis=dict(
        gridcolor='#f5f5f5',
        gridwidth=1,
        zerolinecolor='#f5f5f5',
        zerolinewidth=1,
    ),
    yaxis_tickformat='%',
    paper_bgcolor='white',
    plot_bgcolor='white',
)

fig.update_traces(
    textposition='outside'
)

fig.update_layout(
    autosize=False,
    width=700
)

fig.show()
Conclusion Norwak branch has an unusually high claim expense ratio of 12.82%. This incurs a large hit on the loss of our client since Norwak is the largest claim branch that processes about 45% of the total claims.

What are the types of claims and how frequent are they?

In [72]:
df_claims_by_description = df_claims_master.groupby('Claim Description').agg({
        'Claim ID': 'count', 
        'Claim Amount': 'sum', 
        'Claim Expense': 'sum'
    }) \
    .rename(columns={
        'Claim ID': 'Count'
    }) \
    .reset_index()

df_claims_by_description.loc[1, 'Claim Description'] = 'Damge from impacts with animals'
df_claims_by_description['Average Claim Amount'] = df_claims_by_description['Claim Amount'] / df_claims_by_description['Count']
df_claims_by_description['Average Claim Expense'] = df_claims_by_description['Claim Expense'] / df_claims_by_description['Count']
df_claims_by_description['Claim Expense Ratio'] = df_claims_by_description['Claim Expense'] / df_claims_by_description['Claim Amount']

df_claims_by_description
Out[72]:
Claim Description Count Claim Amount Claim Expense Average Claim Amount Average Claim Expense Claim Expense Ratio
0 Bodily Injury Claims 254 1276721 124949.77 5026.460630 491.928228 0.097868
1 Damge from impacts with animals 264 1345674 129431.97 5097.250000 490.272614 0.096184
2 Injury by Uninsured or Underinsured Defendant 138 2377398 208421.93 17227.521739 1510.303841 0.087668
3 Loss of Value to Vehicle 110 4205567 433787.65 38232.427273 3943.524091 0.103146
4 Personal Injury 133 2236326 230463.42 16814.481203 1732.807669 0.103054
5 Property Damage Claims 88 742408 74777.92 8436.454545 849.749091 0.100723
6 Reimbursement for Property Repairs 99 859358 78658.46 8680.383838 794.529899 0.091532
7 Third Party Car Insurance Claims 268 1278747 128939.08 4771.444030 481.115970 0.100832
In [73]:
# Claim types by count
fig_types_by_count = px.pie(
    df_claims_by_description,
    values='Count',
    names='Claim Description',
    title='Claim Types by Count',
    hole=0.4,
    color_discrete_sequence=px.colors.sequential.Blues
)

fig_types_by_count.update_traces(textposition='outside', textinfo='percent+label')
fig_types_by_count.update_layout(showlegend=False)

fig_types_by_count.show()



# Claim types by amount
fig_types_by_amount = px.pie(
    df_claims_by_description,
    values='Claim Amount',
    names='Claim Description',
    title='Claim Types by Amount',
    hole=0.4,
    color_discrete_sequence=px.colors.sequential.Blues
)

fig_types_by_amount.update_traces(textposition='outside', textinfo='percent+label')
fig_types_by_amount.update_layout(showlegend=False)

fig_types_by_amount.show()

Sales Analysis

I define a "sale" as a new driver signing up with one of the agencies. The driver can either start a new policy or join an existing policy (i.e. join a policy of another family member).

Sales Trend by Month

In [74]:
df_policy_activation = df_risk_master[['Policy Effective Date', 'Agency ID']].copy()
df_policy_activation['YearMonth'] = df_policy_activation['Policy Effective Date'].dt.strftime('%Y-%m')

df_policy_activation.head(3)
Out[74]:
Policy Effective Date Agency ID YearMonth
0 2014-03-09 11 2014-03
1 2011-04-02 15 2011-04
2 2012-01-29 3 2012-01

First, count the number of new policies

In [75]:
df_num_policies_activated_by_date = df_policy_activation \
    .groupby('YearMonth') \
    .size() \
    .to_frame() \
    .reset_index() \
    .rename(columns={
        'Policy Effective Date': 'Date',
        0: 'Activation'
    })

df_num_policies_activated_by_date.head(3)
Out[75]:
YearMonth Activation
0 2010-07 50
1 2010-08 126
2 2010-09 221
In [76]:
fig = px.line(
    df_num_policies_activated_by_date,
    x='YearMonth',
    y='Activation',
    labels={
        'YearMonth': 'Month'
    },
    title='Number of New Drivers Who Activated a Policy by Year-Month'
)
fig.show()
Conclusion The number of new policies (by driver) have been on a steady decline since May 2013.

Save Merged dataframes to pickle files

In [77]:
df_risk_master.to_pickle('df_risk_master.pkl')
df_claims_master.to_pickle('df_claims_master.pkl')